Thursday, January 7, 2016

Generate PDF from a Python-controlled Unix pipeline


By Vasudev Ram


This post is about a new xtopdf app I wrote, called PopenToPDF.py.

(xtopdf is my PDF generation toolkit, written in Python. The toolkit consists of a core library and multiple applications built using it.)

This program, PopenToPDF, shows how to use xtopdf to generate PDF output from any Python-controlled Unix pipeline. It uses the subprocess Python module.

I had written a few posts earlier about the uses of StdinToPDF.py, another xtopdf app [1]

(There are many kinds of pipeline"; it is a powerful concept.)

StdinToPDF is an application of xtopdf that can be used at the end of a Unix or Windows pipeline, to publish the text output of the pipeline to PDF.

[1] Here are some of those posts about StdinToPDF:

a) PDFWriter can create PDF from standard input

b) Print selected text pages to PDF with Python, selpg and xtopdf on Linux

c) Generate Windows Task List to PDF with xtopdf

PopenToPDF has the same general goal as StdinToPDF (to allow creation of a pipeline whose final output is PDF), but works somewhat differently.

Instead of just being used passively (like StdinToPDF) as the last component in a pipeline run from the command line, PopenToPDf is a Python program that itself sets up and runs a pipeline (of all the preceding commands, excepting itself), using subprocess.Popen, and then reads the output of that pipeline, programmatically, and converts the text it reads to PDF. So it is a different approach that may allow for other possibilities for customization.

For the example, I created an input text file of 1000 lines, via a small one-off script. The file is called 1000-lines.txt.

The pipeline (created by PopenToPDF) runs "nl -ba" to add sequential line numbers to each line of the input file. (nl is a Unix command to number lines.) Then the output is passed to my selpg utility (a command-line utility in C), which is a filter that reads its input and selects only a specified range of pages to pass on to the output. (Full details of the selpg utility, including explanation of its logic, source code, and the build steps, are at the URL in the previous sentence, or at links accessible from that URL.)

(This page on sites.harvard.edu is a good resource for Linux command line utility development, and also references my IBM dW article about selpg.)

PopenToPDF sets up the above pipeline (nl -ba piped to selpg), and then reads all the lines from it, adds its own line numbers to the input, and writes it all to a PDF file.

Thus we end up with two sets of line numbers prefixed to each line (in the PDF): the original line numbers added by the nl command, which represents the position of each line extracted from the original file, and the serial numbers (starting from 0) for the subset of lines that PopenToPDF sees.

I did this so that we could verify that the pipeline is extracting the right lines that we specified, by looking at the relative and absolute line numbers in the output (screenshots below).

Here is a screenshot of the first page of the PDF output:


And here is a screenshot of the last page, page 4, of the PDF output:


You can see that the last relative line number (added by PopenToPDF, in the extreme left number column) is 215, and the first was 0 (on the first page), so the number of lines extracted by selpg is 216, which corresponds to what we asked selpg for by specifying a start page of 3 (-s3) and an end page of 5 (-e5), since there are 72 lines per page (the default) and 72 * (5 -3 + 1) = 72 * 3 = 216. You can do a similar calculation for the absolute line numbers shown, to verify that we have extracted not only the right number of pages, but also the right pages.

So this approach (using Popen) can be used to run a pipeline under control of a Python program, read the output of the pipeline, and do some further processing on it. Obviously, it is a generic approach, not limited to producing PDF. It could be used for any other purpose where you want to run a pipeline under program control and do something with the output of the pipeline, in your own Python code.

I'll end with a few points about related topics:

This program is actually an example of a category of data processing operations commonly used in organizations, which can be broadly described as starting with some data source, and passing it through a series of transformations until we have the final output we want.

Often, but not always, the input for these transformations is downloaded from some database or application (of the organization), and/or the output is uploaded to another database or application (also of the organization).

In some of these cases, the process is called ETL, meaning Extract, Transform, Load. This operation is also related to IT system integration.

In general, these tasks can consist of a combination of the use of existing components (programs) and purpose-written code in a compiled or interpreted language. The operation can also consist of a combination of manual and automated steps.

When there is enough uniformity in the data and needed processing rules across runs, using more automation leads to more time and cost savings. Some amount of variation in the data or rules can be handled by parameterization of input and output filenames, database connections, table names, use of conditional logic, etc.

Finally, in the process of writing this program and post (across a couple of sessions), I came across mentions of microservices in tech forums. Microservices have been in the news for a while. So I looked up definitions of microservices and realized that they are in some ways similar to Unix pipelines and the Unix philosophy of creating small tools that do one thing well, and then combining them to achieve bigger tasks.

If you're interested in pipes and Python and their intersection, also check out this HN comment by me, which lists multiple other Python pipe-like tools, including one (pipe_controller) by me:

Yes, pyp is interesting. So are some other roughly similar Python tools

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes


2 comments:

Vasudev Ram said...


I forgot to include the code for PopenToPDF in this post:

So here it is, in this other post:

Code for recent post about PDF from a Python pipeline:

http://jugad2.blogspot.in/2016/01/code-for-recent-post-about-pdf-from.html

I've also included in that post, the code for the one-off script, gen-file.py, which generates the input file 1000-lines.txt, that was used in this post.

Vasudev Ram said...

Referring to my above comment:

Here's the other post with the code again, this time as a clickable link:

Code for recent post about PDF from a Python pipeline