Friday, September 27, 2013

Extract text from Word .docx files with python-docx

By Vasudev Ram

python-docx is a Python library that can be used to extract the text content from Microsoft Word files that are in the .docx format.

Here is a program (modified a bit from the python-docx examples) that shows how to do it:


import sys
from docx import opendocx, getdocumenttext

def extract_docx_text(infil, outfil):

    # Extract the text from the DOCX file object infile and write it to 
    # the text file object outfil.

    paragraphs = getdocumenttext(infil)

    # For Unicode handling.
    new_paragraphs = []
    for paragraph in paragraphs:


def usage():

    return "Usage: python infile.docx outfile.txt\n"

def main():

    if len(sys.argv) != 3:
        print usage()

        infil = opendocx(sys.argv[1])
        outfil = open(sys.argv[2], 'w')
    except Exception, e:
        print "Exception: " + repr(e) + "\n"

    extract_docx_text(infil, outfil)

if __name__ == '__main__':


Save the program as and run it with:

python input_file.docx output_file.txt

That should result in the text of the .docx file being extracted and written to the .txt file.

