Wednesday, July 17, 2013

PDFBuilder can now handle unlimited input files


By Vasudev Ram

I had blogged about PDFBuilder a couple of times earlier, here:

PDFBuilder can create composite PDFs

and here:

PDFBuilder can now take multiple input files from command line

I modified PDFBuilder to be able to take the list of input files from a filename specified on the command line with a -f option. So it can now handle an unlimited (*) number of input files.

(*) Well, strictly speaking, still not unlimited, but limited only by the available memory and hard disk space, and by the maximum size of a single file (the output PDF file). But for practical purposes, that can be considered as unlimited.

Here is the updated PDFBuilder program:
# Filename: PDFBuilder.py
# Description: To create composite PDF files containing the content from 
# a variety of input sources, such as CSV files, TDV (Tab Delimited 
# Values) files, XLS files, etc.

# Author: Vasudev Ram - http://www.dancingbison.com
# Copyright 2012 Vasudev Ram, http://www.dancingbison.com

# This is open source code, released under the New BSD License -
# see http://www.opensource.org/licenses/bsd-license.php .

# ------------------------- imports -------------------------

import sys
import os
import os.path
import string
import csv
from  CSVReader import CSVReader
from  TDVReader import TDVReader
from  PDFWriter import PDFWriter

# ------------------------ class PDFBuilder ----------------

class PDFBuilder:
 """
 Class to build a composite PDF out of multiple input sources.
 """

 def __init__(self, pdf_filename, font, font_size, 
    header, footer, input_filenames):
  """
  PDFBuilder __init__ method.
  """
  self._pdf_filename = pdf_filename
  self._input_filenames = input_filenames

  # Create a PDFWriter instance
  self._pw = PDFWriter(pdf_filename)
  debug("PDFBuilder.__init__(): Created PDFWriter instance")

  # Set its font
  self._pw.setFont(font, font_size)

  # Set the header and footer for the PDFWriter instance
  self._pw.setHeader(header)
  self._pw.setFooter(footer)
  
 def build_pdf(self, input_filenames):
  """
  PDFBuilder.build_pdf method.
  Builds the PDF using contents of the given input_filenames.
  """
  for input_filename in input_filenames:
   # Check if name ends in ".csv", ignoring upper/lower case
   if input_filename[-4:].lower() == ".csv":
    reader = CSVReader(input_filename)
    debug("Created a CSVReader from " + input_filename)
   # Check if name ends in ".csv", ignoring upper/lower case
   elif input_filename[-4:].lower() == ".tdv":
    reader = TDVReader(input_filename)
    debug("Created a TDVReader from " + input_filename)
   else:
    sys.stderr.write("Error: Invalid input file. Exiting\n")
    sys.exit(0)

   debug("Reading from %r" % reader.get_description())
   hdr_str = "Data from reader: " + \
    reader.get_description()
   self._pw.writeLine(hdr_str)
   self._pw.writeLine('-' * len(hdr_str))

   reader.open()
   try:
    while True:
     row = reader.next_row()
     debug("row", row)
     s = ""
     for item in row:
      s = s + item + " "
     debug("s", s)
     self._pw.writeLine(s)
   except StopIteration:
    # Close this reader, save this PDF page, and 
    # start a new one for next reader.
    reader.close()
    self._pw.savePage()
    #continue

 def close(self):
  self._pw.close()

# ------------------------- main() --------------------------

def main():

 # global variables

 # program name for error messages
 global prog_name
 # debug flag - if true, print debug messages, else don't
 global DEBUGGING
 
 # Set the debug flag based on environment variable
 debug_env_var = os.getenv("DEBUG")
 if debug_env_var == "1":
  DEBUGGING = True

 sysargv = sys.argv
 lsa = len(sysargv)

 # Save program filename for error messages
 prog_name = sysargv[0]
 debug("Entered " + prog_name + ":main()")

 # check for right args
 debug("lsa =", lsa)
 if lsa < 2:
  usage()
  debug(prog_name + ": Incorrect number of args, exiting.")
  sys.exit(1)

 # Get output PDF filename from the command line.
 pdf_filename = sys.argv[1]
 debug("PDF filename = ", pdf_filename)

 # Check if -f option given
 if sysargv[2] == '-f' and lsa == 4: 
  # If so, read the input filenames from the file given as 
  # sysargv[3] (the input filenames list)
  input_filenames = []
  with open(sysargv[3], "r") as ifl:
   for fn in ifl:
    input_filenames.append(fn.strip('\n'))
 else:
  # Get the input filenames from the command line.
  input_filenames = sys.argv[2:]

 # Create a PDFBuilder instance.
 pdf_builder = PDFBuilder(pdf_filename, "Courier", 10, 
  "Composite PDF", "Composite PDF", input_filenames)

 # Build the PDF using the inputs.
 pdf_builder.build_pdf(input_filenames)

 pdf_builder.close()

 sys.exit(0)

#------------------------- debug ----------------------------

def debug(msg, *args):

 global DEBUGGING
 if not DEBUGGING:
  return
 sys.stderr.write(msg + ": ")
 sys.stderr.write(repr(args) + "\n")

#------------------------- usage ----------------------------

def usage():
 
 global prog_name
 sys.stderr.write("Usage: python " + prog_name + \
  " pdf_filename input_filename(s)\n" + \
  " OR python " + prog_name + " pdf_filename -f input_filename_list\n" + \
  " where input_filename_list is a file containing input filenames\n")

#------------------------- call main ------------------------

if __name__ == "__main__":
 # Set default value for DEBUGGING, override later in main() 
 # based on value of env. var. DEBUG.
 try:
  DEBUGGING = False
  main()
 except Exception, e:
  sys.stderr.write("Caught an exception: " + e)
  sys.exit(1)


#------------------------- EOF: PDFBuilder.py -----------------
You can run it like this:
python PDFBuilder.py PDFBuilder11.pdf -f input_filename_list.txt
where the same input filenames used in the earlier posts, are now stored in the file input_filename_list.txt, one per line (with no leading or trailing spaces).

This will create the composite PDF file PDFBuilder11.pdf, generated from the contents of all those files, as in the earlier posts.

The difference is that in the previous post about PDFBuilder, the input filenames were specified on the command line, which is subject to some limit for length (in earlier UNIX versions it was typically 512 or 1024 bytes, which sometimes led to errors or core dumps, but it has been increased in more recent UNIX and Linux versions).

But this version of PDFBuilder can handle a very large number of input files, since they are not specified on the command line but in another text file, which is given after the -f option in the above command.

I will upload this new PDFBuilder version to the Bitbucket repository for xtopdf shortly.

To read all my posts about xtopdf, you can use this search:

jugad2.blogspot.com/search/label/xtopdf

and similarly, to read all my posts about Python, use this search:

jugad2.blogspot.com/search/label/python

This is a Blogger feature that I got to know about, thanks to Michael Foord.

- Vasudev Ram - Dancing Bison Enterprises

Contact / Hire me

4 comments:

Craig Maloney said...

Hello,

This looks pretty cool. Would you consider putting it into a github repo so others can download / modify / submit pull requests?

Thanks!

Vasudev Ram said...


Hi there,

Thanks, I'm glad you like PDFBuilder / xtopdf.

I am somewhat busy for a few days, but will put the updated version up on Bitbucket, in a couple of days, after some code cleanup and testing - PDFBuilder was already a part of my xtopdf toolkit, which is at:

https://bitbucket.org/vasudevram/xtopdf

I'll definitely consider putting it up on Github too, but need to think about that for a bit, and check out how much extra overhead there will be to maintain xtopdf on two repository sites, and keep them in sync. There may be some workaround for that.

Meanwhile, if you want to try it out earlier, (obvious, I know, but mentioning it), you could just copy/paste the PDFBuilder code from this blog post. It's released under the BSD license, after all.

Also, please feel free to suggest new features for xtopdf. I'll try to implement what I can.

Thanks,
Vasudev

Vasudev Ram said...


Apropos, interesting post and comments about Bitbucket vs. Github:

http://eli.thegreenplace.net/2013/06/09/switching-my-open-source-projects-from-bitbucket-to-github/

Vasudev Ram said...


@Craig Maloney: The updated PDFBuilder code is now on Bitbucket at:

https://bitbucket.org/vasudevram/xtopdf

- Vasudev.