By Vasudev Ram
python-docx is a Python library that can be used to extract the text content from Microsoft Word files that are in the .docx format.
Here is a program (modified a bit from the python-docx examples) that shows how to do it:
# extract_docx_text.py
import sys
from docx import opendocx, getdocumenttext
def extract_docx_text(infil, outfil):
# Extract the text from the DOCX file object infile and write it to
# the text file object outfil.
paragraphs = getdocumenttext(infil)
# For Unicode handling.
new_paragraphs = []
for paragraph in paragraphs:
new_paragraphs.append(paragraph.encode("utf-8"))
outfil.write('\n'.join(new_paragraphs))
def usage():
return "Usage: python extract_docx_text.py infile.docx outfile.txt\n"
def main():
if len(sys.argv) != 3:
print usage()
sys.exit(1)
try:
infil = opendocx(sys.argv[1])
outfil = open(sys.argv[2], 'w')
except Exception, e:
print "Exception: " + repr(e) + "\n"
sys.exit(1)
extract_docx_text(infil, outfil)
if __name__ == '__main__':
main()
# EOF
Save the program as extract_docx_text.py and run it with:
python extract_docx_text.py input_file.docx output_file.txt
That should result in the text of the .docx file being extracted and written to the .txt file.
- Vasudev Ram - Dancing Bison Enterprises
Make a training or consulting inquiry (Python, open source, Linux ...)
No comments:
Post a Comment
Please be on-topic and civil in your comments. Comments not following these guidelines will be deleted.