By Vasudev Ram
python-docx is a Python library that can be used to extract the text content from Microsoft Word files that are in the .docx format.
Here is a program (modified a bit from the python-docx examples) that shows how to do it:
# extract_docx_text.py import sys from docx import opendocx, getdocumenttext def extract_docx_text(infil, outfil): # Extract the text from the DOCX file object infile and write it to # the text file object outfil. paragraphs = getdocumenttext(infil) # For Unicode handling. new_paragraphs = [] for paragraph in paragraphs: new_paragraphs.append(paragraph.encode("utf-8")) outfil.write('\n'.join(new_paragraphs)) def usage(): return "Usage: python extract_docx_text.py infile.docx outfile.txt\n" def main(): if len(sys.argv) != 3: print usage() sys.exit(1) try: infil = opendocx(sys.argv[1]) outfil = open(sys.argv[2], 'w') except Exception, e: print "Exception: " + repr(e) + "\n" sys.exit(1) extract_docx_text(infil, outfil) if __name__ == '__main__': main() # EOF
Save the program as extract_docx_text.py and run it with:
python extract_docx_text.py input_file.docx output_file.txt
That should result in the text of the .docx file being extracted and written to the .txt file.
- Vasudev Ram - Dancing Bison Enterprises
Make a training or consulting inquiry (Python, open source, Linux ...)
No comments:
Post a Comment