Wednesday, October 2, 2013

Convert Microsoft Word files to PDF with DOCXtoPDF


By Vasudev Ram


DOCX to PDF

Building upon my recent post, here:

Extract text from Word .docx files with python-docx,

I came up with the idea of combining that DOCX text extraction functionality of python-docx with my xtopdf toolkit, to create a program that can convert the text in Microsoft Word DOCX files to PDF format.

[ Note: The conversion has some limitations. E.g. fonts, tables, etc. from the input are not preserved in the output. ]

Here is the program, called DOCXtoPDF.py. It will become a part of my xtopdf toolkit.

# DOCXtoPDF.py

# Author: Vasudev Ram - http://www.dancingbison.com
# Copyright 2012 Vasudev Ram, http://www.dancingbison.com

# This is open source code, released under the New BSD License -
# see http://www.opensource.org/licenses/bsd-license.php .

import sys
import os
import os.path
import string
from textwrap import TextWrapper
from docx import opendocx, getdocumenttext
from PDFWriter import PDFWriter

def docx_to_pdf(infilename, outfilename):

    # Extract the text from the DOCX file object infile and write it to 
    # a PDF file.

    try:
        infil = opendocx(infilename)
    except Exception, e:
        print "Error opening infilename"
        print "Exception: " + repr(e) + "\n"
        sys.exit(1)

    paragraphs = getdocumenttext(infil)

    pw = PDFWriter(outfilename)
    pw.setFont("Courier", 12)
    pw.setHeader("DOCXtoPDF - convert text in DOCX file to PDF")
    pw.setFooter("Generated by xtopdf and python-docx")
    wrapper = TextWrapper(width=70, drop_whitespace=False)

    # For Unicode handling.
    new_paragraphs = []
    for paragraph in paragraphs:
        new_paragraphs.append(paragraph.encode("utf-8"))

    for paragraph in new_paragraphs:
        lines = wrapper.wrap(paragraph)
        for line in lines:
            pw.writeLine(line)
        pw.writeLine("")

    pw.savePage()
    pw.close()
    
def usage():

    return "Usage: python DOCXtoPDF.py infile.docx outfile.txt\n"

def main():

    try:
        # Check for correct number of command-line arguments.
        if len(sys.argv) != 3:
            print "Wrong number of arguments"
            print usage()
            sys.exit(1)
        infilename = sys.argv[1]
        outfilename = sys.argv[2]

        # Check for right infilename extension.
        infile_ext = os.path.splitext(infilename)[1]
        if infile_ext.upper() != ".DOCX":
            print "Input filename extension should be .DOCX"
            print usage()
            sys.exit(1)

        # Check for right outfilename extension.
        outfile_ext = os.path.splitext(outfilename)[1]
        if outfile_ext.upper() != ".PDF":
            print "Output filename extension should be .PDF"
            print usage()
            sys.exit(1)

        docx_to_pdf(infilename, outfilename)

    except Exception, e:
        sys.stderr.write("Error: " + repr(e) + "\n")
        sys.exit(1)

if __name__ == '__main__':
    main()

# EOF


To run DOCXtoPDF, give a command of the form:

python DOCXtoPDF.py infilename.docx outfilename.pdf

After this, the text content of the DOCX file will be in the PDF file.

- Enjoy.



Read other posts about xtopdf on this blog.
Read other posts about Python on this blog.

- Vasudev Ram - Dancing Bison Enterprises

Training or consulting inquiry




O'Reilly 50% Ebook Deal of the Day



2 comments:

Anonymous said...

Hi!

Thanks for this post! I think you have a typo in usage: you should change outfile.txt to outfile.pdf

Vasudev Ram said...


Hi,

>Thanks for this post!

You're welcome.

>I think you have a typo in usage: you should change outfile.txt to outfile.pdf

You're right, the message in the usage() function should say outfile.pdf. Thanks. I won't change it in the post, though, since it is minor, and to do so would mean the post would go to the feeds again. Will just leave the comment here. Developers would spot the typo anyway.