Showing posts with label Python-pipes. Show all posts
Showing posts with label Python-pipes. Show all posts

Thursday, January 7, 2016

Code for recent post about PDF from a Python pipeline

By Vasudev Ram

In this recent post:

Generate PDF from a Python-controlled Unix pipeline ,

I forgot to include the code for the program PopenToPDF.py. Here it is now:
# PopenToPDF.py
# Demo program to read text from a shell pipeline using 
# subprocess.Popen, and write the text to PDF using xtopdf.
# Author: Vasudev Ram
# Copyright (C) 2016 Vasudev Ram - http://jugad2.blogspot.com

import sys
import subprocess
from PDFWriter import PDFWriter

def error_exit(message):
    sys.stderr.write(message + '\n')
    sys.stderr.write("Terminating.\n")
    sys.exit(1)

def main():
    try:
        # Create and set up a PDFWriter instance.
        pw = PDFWriter("PopenTo.pdf")
        pw.setFont("Courier", 12)
        pw.setHeader("Use subprocess.Popen to read pipe and write to PDF.")
        pw.setFooter("Done using selpg, xtopdf, Python and ReportLab, on Linux.")

        # Set up a pipeline with nl and selpg such that we can read from its stdout.
        # nl numbers the lines of the input.
        # selpg extracts pages 3 to 5 from the input.
        pipe = subprocess.Popen("nl -ba 1000-lines.txt | selpg -s3 -e5", \
            shell=True, bufsize=-1, stdout=subprocess.PIPE, 
            stderr=sys.stderr).stdout

        # Read from the pipeline and write the data to PDF, using the PDFWriter instance.
        for idx, line in enumerate(pipe):
            pw.writeLine(str(idx).zfill(8) + ": " + line)
    except IOError as ioe:
        error_exit("Caught IOError: {}".format(str(ioe)))
    except Exception as e:
        error_exit("Caught Exception: {}".format(str(e)))
    finally:
        pw.close()

main()
I ran it in the usual way with:
$ python PopenToPDF.py
to get the output shown in the previous post describing PopenToPDF.

Also, this is the one-off script, gen-file.py, that created the 1000 line input file:
with open("1000-lines.txt", "w") as fil:
    for i in range(1000):
        fil.write("This is a line of text.\n")
fil.close()

- Vasudev

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes

Generate PDF from a Python-controlled Unix pipeline


By Vasudev Ram


This post is about a new xtopdf app I wrote, called PopenToPDF.py.

(xtopdf is my PDF generation toolkit, written in Python. The toolkit consists of a core library and multiple applications built using it.)

This program, PopenToPDF, shows how to use xtopdf to generate PDF output from any Python-controlled Unix pipeline. It uses the subprocess Python module.

I had written a few posts earlier about the uses of StdinToPDF.py, another xtopdf app [1]

(There are many kinds of pipeline"; it is a powerful concept.)

StdinToPDF is an application of xtopdf that can be used at the end of a Unix or Windows pipeline, to publish the text output of the pipeline to PDF.

[1] Here are some of those posts about StdinToPDF:

a) PDFWriter can create PDF from standard input

b) Print selected text pages to PDF with Python, selpg and xtopdf on Linux

c) Generate Windows Task List to PDF with xtopdf

PopenToPDF has the same general goal as StdinToPDF (to allow creation of a pipeline whose final output is PDF), but works somewhat differently.

Instead of just being used passively (like StdinToPDF) as the last component in a pipeline run from the command line, PopenToPDf is a Python program that itself sets up and runs a pipeline (of all the preceding commands, excepting itself), using subprocess.Popen, and then reads the output of that pipeline, programmatically, and converts the text it reads to PDF. So it is a different approach that may allow for other possibilities for customization.

For the example, I created an input text file of 1000 lines, via a small one-off script. The file is called 1000-lines.txt.

The pipeline (created by PopenToPDF) runs "nl -ba" to add sequential line numbers to each line of the input file. (nl is a Unix command to number lines.) Then the output is passed to my selpg utility (a command-line utility in C), which is a filter that reads its input and selects only a specified range of pages to pass on to the output. (Full details of the selpg utility, including explanation of its logic, source code, and the build steps, are at the URL in the previous sentence, or at links accessible from that URL.)

(This page on sites.harvard.edu is a good resource for Linux command line utility development, and also references my IBM dW article about selpg.)

PopenToPDF sets up the above pipeline (nl -ba piped to selpg), and then reads all the lines from it, adds its own line numbers to the input, and writes it all to a PDF file.

Thus we end up with two sets of line numbers prefixed to each line (in the PDF): the original line numbers added by the nl command, which represents the position of each line extracted from the original file, and the serial numbers (starting from 0) for the subset of lines that PopenToPDF sees.

I did this so that we could verify that the pipeline is extracting the right lines that we specified, by looking at the relative and absolute line numbers in the output (screenshots below).

Here is a screenshot of the first page of the PDF output:


And here is a screenshot of the last page, page 4, of the PDF output:


You can see that the last relative line number (added by PopenToPDF, in the extreme left number column) is 215, and the first was 0 (on the first page), so the number of lines extracted by selpg is 216, which corresponds to what we asked selpg for by specifying a start page of 3 (-s3) and an end page of 5 (-e5), since there are 72 lines per page (the default) and 72 * (5 -3 + 1) = 72 * 3 = 216. You can do a similar calculation for the absolute line numbers shown, to verify that we have extracted not only the right number of pages, but also the right pages.

So this approach (using Popen) can be used to run a pipeline under control of a Python program, read the output of the pipeline, and do some further processing on it. Obviously, it is a generic approach, not limited to producing PDF. It could be used for any other purpose where you want to run a pipeline under program control and do something with the output of the pipeline, in your own Python code.

I'll end with a few points about related topics:

This program is actually an example of a category of data processing operations commonly used in organizations, which can be broadly described as starting with some data source, and passing it through a series of transformations until we have the final output we want.

Often, but not always, the input for these transformations is downloaded from some database or application (of the organization), and/or the output is uploaded to another database or application (also of the organization).

In some of these cases, the process is called ETL, meaning Extract, Transform, Load. This operation is also related to IT system integration.

In general, these tasks can consist of a combination of the use of existing components (programs) and purpose-written code in a compiled or interpreted language. The operation can also consist of a combination of manual and automated steps.

When there is enough uniformity in the data and needed processing rules across runs, using more automation leads to more time and cost savings. Some amount of variation in the data or rules can be handled by parameterization of input and output filenames, database connections, table names, use of conditional logic, etc.

Finally, in the process of writing this program and post (across a couple of sessions), I came across mentions of microservices in tech forums. Microservices have been in the news for a while. So I looked up definitions of microservices and realized that they are in some ways similar to Unix pipelines and the Unix philosophy of creating small tools that do one thing well, and then combining them to achieve bigger tasks.

If you're interested in pipes and Python and their intersection, also check out this HN comment by me, which lists multiple other Python pipe-like tools, including one (pipe_controller) by me:

Yes, pyp is interesting. So are some other roughly similar Python tools

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Posts about Python  Posts about xtopdf

My ActiveState recipes


Tuesday, October 16, 2012

Swapping pipe components at runtime with pipe_controller


By Vasudev Ram

In my previous post on pipe_controller, Using PipeController to run a pipe incrementally, I mentioned that it had some interesting properties. That post gave an example of one such property: the ability to run a pipe incrementally under program control, with successive outputs going to different files on each incremental run.

This post talks about another pipe_controller property that I discovered by experimentation: you can swap the order of components in a pipeline at runtime, programmatically. That is, you can do something like (using UNIX syntax, though pipe_controller is in Python and works differently):

foo | bar | baz # with output going to file 1

then swap the positions of foo and baz, then run the pipe again:

baz | bar | foo # with output going to file 2

and so on - any number of times, all in the same program run.

This feature lets you experiment with, and validate, your pipeline logic, to make sure that it does what you intended, e.g. you can check the output both before and after swapping components of the pipe, to decide which order you really need - or which order is optimal - see next paragraph.

The feature can also be used to time the execution of two or more different versions of the pipeline (with/without swapping of components), to see which runs faster, in cases where changing the order of those components makes no difference to the output, e.g. if those two components are commutative, in the mathematical sense (like a + b = b + a).

Obvious caveat: a timing test will only show you whether version A or B is faster for the given input, not for other inputs. But after studying the results of a few tests, you may be able to use logic or induction to figure out a rule (about the relative speeds) that applies to most or all of the data.

To enable the feature, I added this method, swap_processors(), to the PipeController class (in file pipe_controller.py):
def swap_processors(self, processor1, processor2):
  """
  PipeController method.
  It lets the caller swap the positions of two 
  processors in the list.
  """
  debug("entered PipeController.swap_processors")
  pos1 = find_element(self._processors, processor1)
  pos2 = find_element(self._processors, processor2)
  if (pos1 == -1) or (pos2 == -1):
   # Either or both processors not found, exit.
   sys.stderr.write("Error: processor1 or 2 not found in list\n")
   sys.exit(1)
  else:
   # Found both, swap their positions.
   self._processors[pos1], self._processors[pos2] = \
    self._processors[pos2], self._processors[pos1] 
  debug("exiting PipeController.swap_processors")
and which uses this function, find_element():
# Find index of given element in list lis.
# Return index (>=0) if found, else -1.

def find_element(lis, element):
 try:
  pos = lis.index(element)
 except ValueError:
  pos = -1
 return pos
With these additions, you can run this program, test_pipe_controller_04.py, which demos swapping pipe components at runtime. It uses the same input file, it1 as in the earlier blog about pipe_controller:
$ cat it1
     1  some lowercase text
     2  more lowercase text
     3  even more lowercase text
     4  yet more lowercase text
Run the new test program like this:
$ python test_pipe_controller_04.py it1 ot04-
The last command-line argument, ot04-, ends with a hyphen because it is a prefix for the 3 output files created: ot04-001, ot04-002, and ot04-003.

The test program does these things:

1. Runs the pipe [ oto0, eto3, upcase, delspace ] on the input. The output is:
$ cat ot04-001
1       S0M3L0W3RCAS3T3XT
2       M0R3L0W3RCAS3T3XT
3       3V3NM0R3L0W3RCAS3T3XT
4       Y3TM0R3L0W3RCAS3T3XT

2. Swaps the positions of oto0 and upcase. Then runs the modified pipe [ upcase, eto3, oto0, delspace ] on the same input. The output is:
$ cat ot04-002
1       SOMELOWERCASETEXT
2       MORELOWERCASETEXT
3       EVENMORELOWERCASETEXT
4       YETMORELOWERCASETEXT
Due to the modified pipeline, all lowercase letters gets converted to uppercase first, so the later-run functions eto3 and oto0 now have no effect on the input, but delspace still does.

3. Swaps the current positions of eto3 and upcase. Then runs the modified pipe [ eto3, upcase, oto0, delspace ] on the same input. The output is:
$ cat ot04-003
1       SOM3LOW3RCAS3T3XT
2       MOR3LOW3RCAS3T3XT
3       3V3NMOR3LOW3RCAS3T3XT
4       Y3TMOR3LOW3RCAS3T3XT
This time, due to the pipeline being modified again, all lowercase letters "e" get converted to uppercase, then all letters get converted to uppercase, so the later-run function oto0 now has no effect on the input, but delspace still does.

To reiterate, this ability to swap components at runtime, and re-run the pipe (with output going to a different file each time), allows you to experiment with / validate your pipeline logic, and/or to do performance comparison of different pipeline orderings.

This updated version of pipe_controller is available on Bitbucket, here:

Python pipe_controller module.

- Vasudev Ram - Dancing Bison Enterprises

Wednesday, October 10, 2012

Created a Bitbucket repository for pipe_controller

By Vasudev Ram


I created a Bitbucket repository for my pipe_controller Python project (referred to in my earlier posts as PipeController, but since that is the Python class name, I'm calling it by the Python module name, from now on, since that is more appropriate).

The pipe_controller repository is here.

pipe_controller is released under the BSD 3-Clause license.

- Vasudev Ram - Dancing Bison Enterprises