jugad2 - Vasudev Ram on software innovation

- Vasudev Ram - Online Python training and programming

Share |

Thursday, January 7, 2016

Code for recent post about PDF from a Python pipeline

By Vasudev Ram

In this recent post:

Generate PDF from a Python-controlled Unix pipeline ,

I forgot to include the code for the program PopenToPDF.py. Here it is now:

# PopenToPDF.py
# Demo program to read text from a shell pipeline using 
# subprocess.Popen, and write the text to PDF using xtopdf.
# Author: Vasudev Ram
# Copyright (C) 2016 Vasudev Ram - http://jugad2.blogspot.com

import sys
import subprocess
from PDFWriter import PDFWriter

def error_exit(message):
    sys.stderr.write(message + '\n')
    sys.stderr.write("Terminating.\n")
    sys.exit(1)

def main():
    try:
        # Create and set up a PDFWriter instance.
        pw = PDFWriter("PopenTo.pdf")
        pw.setFont("Courier", 12)
        pw.setHeader("Use subprocess.Popen to read pipe and write to PDF.")
        pw.setFooter("Done using selpg, xtopdf, Python and ReportLab, on Linux.")

        # Set up a pipeline with nl and selpg such that we can read from its stdout.
        # nl numbers the lines of the input.
        # selpg extracts pages 3 to 5 from the input.
        pipe = subprocess.Popen("nl -ba 1000-lines.txt | selpg -s3 -e5", \
            shell=True, bufsize=-1, stdout=subprocess.PIPE, 
            stderr=sys.stderr).stdout

        # Read from the pipeline and write the data to PDF, using the PDFWriter instance.
        for idx, line in enumerate(pipe):
            pw.writeLine(str(idx).zfill(8) + ": " + line)
    except IOError as ioe:
        error_exit("Caught IOError: {}".format(str(ioe)))
    except Exception as e:
        error_exit("Caught Exception: {}".format(str(e)))
    finally:
        pw.close()

main()

I ran it in the usual way with:

$ python PopenToPDF.py

to get the output shown in the previous post describing PopenToPDF.

Also, this is the one-off script, gen-file.py, that created the 1000 line input file:

with open("1000-lines.txt", "w") as fil:
    for i in range(1000):
        fil.write("This is a line of text.\n")
fil.close()

- Vasudev

Signup to hear about new products and services I create.

Signup to hear about new products and services I create.

Share |

Generate PDF from a Python-controlled Unix pipeline

By Vasudev Ram

This post is about a new xtopdf app I wrote, called PopenToPDF.py.

(xtopdf is my PDF generation toolkit, written in Python. The toolkit consists of a core library and multiple applications built using it.)

This program, PopenToPDF, shows how to use xtopdf to generate PDF output from any Python-controlled Unix pipeline. It uses the subprocess Python module.

I had written a few posts earlier about the uses of StdinToPDF.py, another xtopdf app [1]

(There are many kinds of pipeline"; it is a powerful concept.)

StdinToPDF is an application of xtopdf that can be used at the end of a Unix or Windows pipeline, to publish the text output of the pipeline to PDF.

[1] Here are some of those posts about StdinToPDF:

a) PDFWriter can create PDF from standard input

b) Print selected text pages to PDF with Python, selpg and xtopdf on Linux

c) Generate Windows Task List to PDF with xtopdf

PopenToPDF has the same general goal as StdinToPDF (to allow creation of a pipeline whose final output is PDF), but works somewhat differently.

Instead of just being used passively (like StdinToPDF) as the last component in a pipeline run from the command line, PopenToPDf is a Python program that itself sets up and runs a pipeline (of all the preceding commands, excepting itself), using subprocess.Popen, and then reads the output of that pipeline, programmatically, and converts the text it reads to PDF. So it is a different approach that may allow for other possibilities for customization.

For the example, I created an input text file of 1000 lines, via a small one-off script. The file is called 1000-lines.txt.

The pipeline (created by PopenToPDF) runs "nl -ba" to add sequential line numbers to each line of the input file. (nl is a Unix command to number lines.) Then the output is passed to my selpg utility (a command-line utility in C), which is a filter that reads its input and selects only a specified range of pages to pass on to the output. (Full details of the selpg utility, including explanation of its logic, source code, and the build steps, are at the URL in the previous sentence, or at links accessible from that URL.)

(This page on sites.harvard.edu is a good resource for Linux command line utility development, and also references my IBM dW article about selpg.)

PopenToPDF sets up the above pipeline (nl -ba piped to selpg), and then reads all the lines from it, adds its own line numbers to the input, and writes it all to a PDF file.

Thus we end up with two sets of line numbers prefixed to each line (in the PDF): the original line numbers added by the nl command, which represents the position of each line extracted from the original file, and the serial numbers (starting from 0) for the subset of lines that PopenToPDF sees.

I did this so that we could verify that the pipeline is extracting the right lines that we specified, by looking at the relative and absolute line numbers in the output (screenshots below).

Here is a screenshot of the first page of the PDF output:

And here is a screenshot of the last page, page 4, of the PDF output:

You can see that the last relative line number (added by PopenToPDF, in the extreme left number column) is 215, and the first was 0 (on the first page), so the number of lines extracted by selpg is 216, which corresponds to what we asked selpg for by specifying a start page of 3 (-s3) and an end page of 5 (-e5), since there are 72 lines per page (the default) and 72 * (5 -3 + 1) = 72 * 3 = 216. You can do a similar calculation for the absolute line numbers shown, to verify that we have extracted not only the right number of pages, but also the right pages.

So this approach (using Popen) can be used to run a pipeline under control of a Python program, read the output of the pipeline, and do some further processing on it. Obviously, it is a generic approach, not limited to producing PDF. It could be used for any other purpose where you want to run a pipeline under program control and do something with the output of the pipeline, in your own Python code.

I'll end with a few points about related topics:

This program is actually an example of a category of data processing operations commonly used in organizations, which can be broadly described as starting with some data source, and passing it through a series of transformations until we have the final output we want.

Often, but not always, the input for these transformations is downloaded from some database or application (of the organization), and/or the output is uploaded to another database or application (also of the organization).

In some of these cases, the process is called ETL, meaning Extract, Transform, Load. This operation is also related to IT system integration.

In general, these tasks can consist of a combination of the use of existing components (programs) and purpose-written code in a compiled or interpreted language. The operation can also consist of a combination of manual and automated steps.

When there is enough uniformity in the data and needed processing rules across runs, using more automation leads to more time and cost savings. Some amount of variation in the data or rules can be handled by parameterization of input and output filenames, database connections, table names, use of conditional logic, etc.

Finally, in the process of writing this program and post (across a couple of sessions), I came across mentions of microservices in tech forums. Microservices have been in the news for a while. So I looked up definitions of microservices and realized that they are in some ways similar to Unix pipelines and the Unix philosophy of creating small tools that do one thing well, and then combining them to achieve bigger tasks.

If you're interested in pipes and Python and their intersection, also check out this HN comment by me, which lists multiple other Python pipe-like tools, including one (pipe_controller) by me:

Yes, pyp is interesting. So are some other roughly similar Python tools

- Enjoy.

- Vasudev Ram - Online Python training and programming

Signup to hear about new products and services I create.

Share |

Wednesday, December 23, 2015

Generate Windows Task List to PDF with xtopdf

By Vasudev Ram

While working at the DOS command line in Windows, I had the idea of using the DOS TASKLIST command along with xtopdf, my PDF generation toolkit, to generate a list of currently running Windows tasks to a PDF file, along with some other info, such as whether a task is a service or a console process, the process id, the memory usage, etc. The TASKLIST command shows all that information, by default.

I also sorted the output in ascending order by the Mem Usage field, by passing it through the DOS SORT command. (I could have sorted it by any other field such as the Image Name or the PID, of course.) I starred out some of the fields in the output.

Here are the steps to generate a Windows task list as a PDF, using xtopdf:

( I use $ as the prompt, even in DOS :)

1: Run TASKLIST and redirect its output to a text file.

$ tasklist > tasklist.out

2: Sort the file into another file.

$ sort /+65 tasklist.out > tasklist.srt

(Sort the output of TASKLIST by the character position of the Mem Usage field.)

3: Go edit tasklist to put the header lines back at the top :)
[ They get dislodged by the sort. ]

[ This is not Unix, so you can't easily do the fast, fluid command-line data munging that you can on Unix, unless you use something like Cygwin or UWin.

UWin was developed by David Korn, creator of the Korn Shell, for Windows. You can get UWin from the AT&T site here (after doing a convoluted license agreement dance, last time I checked). But IMO, the dance is not too long, and is worth it, to get a suite of Unix tools that work well on Windows, and UWin is also smaller & lighter than Cygwin, though not so comprehensive.
Be sure to read the section "Korn shell and Microsoft" at the David Korn link above :-) ]

4: Pipe the sorted task list to StdinToPDF, to generate the PDF output.

$ type tasklist.srt | python StdinToPDF.py tasklist.pdf

We just pipe the output of TASKLIST to StdinToPDF.py (an xtopdf app), which can be used at the end of any arbitrary command pipeline that generates text (on Unix / Windows / Linux / Mac OS X), to convert that text to PDF.

A screenshot of the PDF output I got (viewed in Foxit PDF Reader), is shown at the top of this post.

- Enjoy.

- Vasudev Ram - Online Python training and programming

Share |

Monday, December 21, 2015

Microsoft to acquire Linux ...

... skills. :-)

So says:

<a href="https://mobile.twitter.com/linuxfoundation/status/678665931434815490">The Linux Foundation</a>

- Vasudev Ram
jugad2.blogspot.com

Sunday, November 1, 2015

data_dump, a Python tool like Unix od (octal dump)

- Vasudev Ram - Online Python training and programming

The Unix od command, which stands for octal dump, should be known to regular Unix users. Though the name includes the word octal (for historical reasons) [1], it supports other numeric systems as well; see below.

[1] See:

The Wikipedia page for od, which says that "od is one of the earliest Unix programs, having appeared in version 1 AT&T Unix."

od is a handy tool. It dumps the contents of a file (or standard input) to standard output, in "unambiguous" ways, such as the ability to show the file contents as numeric values (ASCII codes), interpreted as bytes / two-byte words / etc. It can do this in octal, decimal, binary or hexadecimal format. It can also show the content as characters. But the Unix cat command does that already, so the od command is more often used to show characters along with their numeric codes. It also shows the byte offset (from the start of the file) of every, say, 10th character in the file, in the left column of its output, so the user can keep track of where any content occurs in the file.

All this is useful because it allows Unix users (programmers and system administrators as well as end users) to inspect the contents of files in different ways (hex, binary, character, etc.). The files thus inspected could be text files or binary files of any kind. Often, programmers use the output of od to debug their application, by viewing a file that their program is either reading from or writing to, to verify that it contains what they expect, or to find that it contains something that they do not expect - which could be due either to invalid input or to a bug in their program causing incorrect output.

I needed to use od recently. Doing so made me think of writing a simple version of it in Python, for fun and practice. So I did it. I named it data_dump.py. Here is the code for it:

'''
Program name: data_dump.py
Author: Vasudev Ram.
Copyright 2015 Vasudev Ram.
Purpose: To dump the contents of a specified file or standard input, 
to the standard output, in one or more formats, such as:
    - as characters
    - as decimal numbers
    - as hexadecimal numbers
    - as octal numbers
    
Inspired by the od (octal dump) command of Unix, and intended to work,
very roughly, like it. Will not attempt to replicate od exactly or even 
closely. May diverge from od's way of doing things, as desired.
'''

# Imports:

from __future__ import print_function
import sys

# Global constants:

# Maximum number of character (from the input) to output per line.
MAX_CHARS_PER_LINE = 16

# Global variables:

# Functions:

def data_dump(infil, line_len=MAX_CHARS_PER_LINE, options=None):
    '''
    Dumps the data from the input source infil to the standard output.
    '''
    byte_addr = 0
    buf = infil.read(line_len)
    # While not EOF.
    while buf != '':
        # Print the offset of the first character to be output on this line.
        # The offset refers to the offset of that character in the input,
        # not in the output. The offset is 0-based.
        sys.stdout.write("{:>08s}: ".format(str(byte_addr)))

        # Print buf in character form, with . for control characters.
        # TODO: Change to use \n for line feed, \t for tab, etc., for 
        # those control characters which have unambiguous C escape 
        # sequences.
        byte_addr += len(buf)
        for c in buf:
            sys.stdout.write('  ') # Left padding before c as char.
            if (0 <= ord(c) <= 31) or (c == 127):
                sys.stdout.write('.')
            else:
                sys.stdout.write(c)
        sys.stdout.write('\n')

        # Now print buf in hex form.
        sys.stdout.write(' ' * 10) # Padding to match that of byte_addr above.
        for c in buf:
            sys.stdout.write(' ') # Left padding before c in hex.
            sys.stdout.write('{:>02s}'.format((hex(ord(c))[2:].upper())))
        sys.stdout.write('\n')
        buf = infil.read(line_len)
    infil.close()


def main():
    '''
    Checks the arguments, sets option flags, sets input source.
    Then calls data_dump() function with the input source and options.
    '''
    try:
        lsa = len(sys.argv)
        if lsa == 1:
            # Input from standard input.
            infil = sys.stdin
        elif lsa == 2:
            # Input from a file.
            infil = open(sys.argv[1], "rb")
        data_dump(infil)
        sys.exit(0)
    except IOError as ioe:
        print("Error: IOError: " + str(ioe))
        sys.exit(1)

if __name__ == '__main__':
    main()

And here is the output of a sample run, on a small text file:

$ data_dump.py t3
00000000:   T  h  e     q  u  i  c  k     b  r  o  w  n
           54 68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20
00000016:   f  o  x     j  u  m  p  e  d     o  v  e  r
           66 6F 78 20 6A 75 6D 70 65 64 20 6F 76 65 72 20
00000032:   t  h  e     l  a  z  y     d  o  g  .  .  .  T
           74 68 65 20 6C 61 7A 79 20 64 6F 67 2E 0D 0A 54
00000048:   h  e     q  u  i  c  k     b  r  o  w  n     f
           68 65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20 66
00000064:   o  x     j  u  m  p  e  d     o  v  e  r     t
           6F 78 20 6A 75 6D 70 65 64 20 6F 76 65 72 20 74
00000080:   h  e     l  a  z  y     d  o  g  .  .  .  T  h
           68 65 20 6C 61 7A 79 20 64 6F 67 2E 0D 0A 54 68
00000096:   e     q  u  i  c  k     b  r  o  w  n     f  o
           65 20 71 75 69 63 6B 20 62 72 6F 77 6E 20 66 6F
00000112:   x     j  u  m  p  e  d     o  v  e  r     t  h
           78 20 6A 75 6D 70 65 64 20 6F 76 65 72 20 74 68
00000128:   e     l  a  z  y     d  o  g  .
           65 20 6C 61 7A 79 20 64 6F 67 2E

$

Note that I currently replace control / non-printable characters by a dot, in the output. Another option could be to replace (at least some of) them with C escape sequences, such as \r (carriage return, ASCII 13), \n (line feed, ASCII 10), etc. That is the way the original od does it.

In a future post, I'll make some improvements, and also show and discuss some interesting and possibly anomalous results that I got when testing data_dump.py with different inputs.

Happy dumping! :)

Details of the above image are available here:

Truck image credits

Signup to hear about new products and services I create.

CHIP – The World’s First $9 Computer | Ardevon

Share |

Sunday, July 5, 2015

Nine dollar Linux computer

The game's afoot, Watson!

- Vasudev Ram
jugad2.blogspot.com
dancingbison.com (site down a while due to changing host taking time)

Friday, May 8, 2015

tabtospaces, utility to change tabs to spaces in Python files

- Vasudev Ram - Online Python training and programming

Near the end of a recent blog post:

asciiflow.com: Draw flowcharts online, in ASCII

, I showed how this small snippet of Python code can be used to make a Python program usable as a component in a Unix pipeline:

for lin in sys.stdin:
    sys.stdout.write(process(lin))

Today I saw Raymond Hettinger (@raymondh)'s tweet about the -t and -tt command line options of Python:

#python tip: In Python 2, the -tt option raises an error when you foolishly mix spaces and tabs. In Python 3, that is always an error.

That made me think of writing a simple Python 2 tool to change the tabs in a Python file to spaces. Yes, I know it can be easily done in Unix or Windix [1] with any of sed / awk / tr etc. That's not the point. So here is tabtospaces.py:

import sys
for lin in sys.stdin:
    sys.stdout.write(lin.replace("\t", "    "))

[ Note: this code converts each tab into 4 spaces. It can be parameterized by passing a command-line option that specifies the number of spaces, such as 4 or 8, and then replacing each tab with that many spaces. Also note that I have not tested the program on many sets of data, just one for now. ]

I created a simple Python file, test1.py, that has mixed tabs and spaces to use as input to tabtospaces.py. Then I ran the following commands:

$ py -tt test1.py
  File "test1.py", line 4
    print arg,
              ^
TabError: inconsistent use of tabs and spaces in indentation

$ py tabtospaces.py < test1.py > test2.py

$ py -tt test2.py
0 1 2 3 4 5 6 7 8 9

which shows that tabtospaces.py does convert the tabs to spaces.

And you can see from this diff that the original test1.py and the test2.py generated by tabtospaces.py, differ only in the use of tabs vs. spaces:

$ fc /l test1.py test2.py
Comparing files test1.py and TEST2.PY
***** test1.py
    for arg in args:
                print arg,

***** TEST2.PY
    for arg in args:
        print arg,

*****

[1] Windix is the latest upcoming Unix-compatible OS from M$, due Real Soon Now. You heard it here first - TM.

Dancing Bison Enterprises

Signup to hear about new software products or info-products that I create.

Share |

Sunday, April 19, 2015

asciiflow.com: Draw flowcharts online, in ASCII

- Vasudev Ram - Online Python and Linux training;
freelance Python programming

Saw this today: asciiflow.com

asciiflow.com is a site that allows you to draw flowcharts online, on their site, using the metaphor of a drag-and-drop paint program like MS Paint, but the flowcharts are drawn entirely using ASCII characters.

I tried it out a bit. Innovative.

One point is that to save the flowchart, it requires access to your Google Drive account.

The image at the top of this page, is of a flowchart that I created with asciiflow.com. I did not use the Save feature, but instead took a screenshot and saved it as a PNG file (using MS Paint, ha ha). The flowchart shows a diagram that illustrates the concept of a UNIX command pipeline, where the standard output of a preceding program becomes the standard input of a succeeding one (in the pipeline). (How's that for using web-based and Windows software to illustrate something about UNIX? :)

For another example of the innovative use of ASCII characters, check out this post I wrote somewhat recently, about the Python library called PrettyTable, which lets you generate visually appealing tables of data, bordered and boxed by ASCII characters:

PrettyTable to PDF is pretty easy with xtopdf

Also, since we're talking about standard input and output and UNIX pipelines, these two posts may be of interest:

1) [xtopdf] PDFWriter can create PDF from standard input

(The post at the above link also has an example of eating your own dog food.)

2) Print selected text pages to PDF with Python, selpg and xtopdf on Linux

Generalizing from a fragment of code in post 1) above, I'll also note that making a Python program usable as a component of a UNIX pipeline, can, in some cases, be as simple as having something like this in your code:

import sys
# ...
    for lin in sys.stdin:
        lin = process(lin)
        sys.stdout.write(lin)

which could be shortened to:

for lin in sys.stdin:
    sys.stdout.write(process(lin))

Due to this (being able to easily make a Python program into a component of a UNIX pipeline), you can do things like this (and more):

$ foo | bar | baz

where foo may be a built-in UNIX command (a filter) or a shell script, bar may be (for example) a Perl program that leverages some powerful Perl features, and baz may be a Python program that leverages some powerful Python features, thereby leveraging the UNIX philosophy concept of writing small programs, each of which do one thing well, or in this case, leveraging the features of different languages (each of which may do some things better than others), to write individual components in those respective languages. The possibilities are limitless ...

- Enjoy.

Dancing Bison Enterprises

Signup to hear about new software products that I create.

Share |

Friday, April 17, 2015

Linux skills in high demand in 2015, says Linux Foundation newsletter

- Vasudev Ram - Online Python and Linux training and programming

Just saw this news via the newsletter that I get from the Linux Foundation:

Linux skills are going to be in high demand in 2015, according to a survey carried out by Dice.com and the Linux Foundation. This is the 4th year in a row that the survey has been done.

Excerpts from the report:
[
“Competition for Linux talent is accelerating, as the software becomes more ubiquitous,” said Shravan Goli, President of Dice

”Demand for Linux talent continues apace, and it’s becoming more important for employers to be able to verify candidates have the skillsets they need,” said Jim Zemlin, executive director at The Linux Foundation.
]

Dancing Bison Enterprises

Signup to hear about new software or info products that I create.

Share |

Friday, March 20, 2015

A simple UNIX-like "which" command in Python

- Vasudev Ram - Online Python training and programming

UNIX users are familiar with the which command. Given an argument called name, it checks the system PATH environment variable, to see whether that name exists (as a file) in any of the directories specified in the PATH. (The directories in the PATH are colon-separated on UNIX and semicolon-separated on Windows.)

I'd written a Windows-specific version of the which command some time ago, in C.

Today I decided to write a simple version of the which command in Python. In the spirit of YAGNI and incremental development, I tried to resist the temptation to add more features too early; but I did give in once and add the exit code stuff near the end :)

Here is the code for which.py:

from __future__ import print_function

# which.py
# A minimal version of the UNIX which utility, in Python.
# Author: Vasudev Ram - www.dancingbison.com
# Copyright 2015 Vasudev Ram - http://www.dancingbison.com

import sys
import os
import os.path
import stat

def usage():
    sys.stderr.write("Usage: python which.py name\n") 
    sys.stderr.write("or: which.py name\n") 

def which(name):
    found = 0 
    for path in os.getenv("PATH").split(os.path.pathsep):
        full_path = path + os.sep + name
        if os.path.exists(full_path):
            """
            if os.stat(full_path).st_mode & stat.S_IXUSR:
                found = 1
                print(full_path)
            """
            found = 1
            print(full_path)
    # Return a UNIX-style exit code so it can be checked by calling scripts.
    # Programming shortcut to toggle the value of found: 1 => 0, 0 => 1.
    sys.exit(1 - found)

def main():
    if len(sys.argv) != 2:
        usage()
        sys.exit(1)
    which(sys.argv[1])

if "__main__" == __name__:
        main()

And here are a few examples of using the command:

(Note: the tests are done on Windows, though the command prompt is a $ sign (UNIX default); I just set it to that because I like $'s and UNIX :)

$ which vim
\vim

$ which vim.exe
C:\Ch\bin\vim.exe

$ set PATH | grep -i vim73

$ addpath c:\vim\vim73

$ which.py vim.exe
C:\Ch\bin\vim.exe

c:\vim\vim73\vim.exe
$ which metapad.exe
C:\util\metapad.exe

$ which pscp.exe
C:\util\pscp.exe
C:\Ch\bin\pscp.exe

$ which dostounix.exe
C:\util\dostounix.exe

$ which pythonw.exe
C:\Python278\pythonw.exe
D:\Anaconda-2.1.0-64\pythonw.exe

# Which which is which? All four combinations:

$ which which
.\which

$ which.py which
.\which

$ which which.py
.\which.py

$ which.py which.py
.\which.py

As you can see, calling the which Python command with different arguments, gives various results, including sometimes finding one instance of vim.exe and sometimes two instances, depending on the values in the PATH variable (which I changed, using my addpath.bat script, to add the \vim\vim73 directory to it).

Also, it works when invoked either as which.py or just which.

I'll discuss my interpretation of these variations in an upcoming post - including a variation that uses os.stat(full_path).st_mode - see the commented part of the code under the line:

if os.path.exists(full_path):

Meanwhile, did you know that YAGNI was written about much before agile was a thing? IIRC, I've seen it described in either Kernighan and Ritchie (The C Programming Language) or in Kernighan and Pike (The UNIX Programming Environment). It could be possibly be older than that, say from the mainframe era.

Finally, as I was adding labels to this blog post, Blogger showed me "pywhich" as a label, after I typed "which" in the labels box. That reminded me that I had written another post earlier about a Python which utility (not by me), so I found it on my blog by typing in this URL:

http://jugad2.blogspot.in/search/label/pywhich

which finds all posts on my blog with the label 'pywhich' (and the same approach works for any other label); the resulting post is:

pywhich, like the Unix which tool, for Python modules.

- Enjoy.

Dancing Bison Enterprises

Signup to hear about new products that I create.

Click here to get email about new products from Vasudev Ram.

Share |

Friday, October 24, 2014

Print selected text pages to PDF with Python, selpg and xtopdf on Linux

By Vasudev Ram

In a recent blog post, titled My IBM developerWorks article, I talked about a tutorial that I had written for IBM developerWorks a while ago. The tutorial showed some of the recommended techniques and practices to follow when writing a Linux command-line utility that is intended for production use, and how to write it in such a way that it can easily cooperate with existing UNIX command-line tools, when used in a UNIX command pipeline.

This ability of properly written command-line tools to cooperate with each other when used in a pipeline, is, as I said in that IBM article, one of the keys to the power of Linux (and UNIX) as a development environment. (See the classic book The UNIX Programming Environment, for much more on this topic.)

The utility I wrote and discussed (in that IBM article), called selpg (for SELect PaGes), allows the user to select a specified range of pages from a text file. At the end of the aforementioned blog post, I had said that I would show some practical uses of the selpg utility later. I describe one such use case below, involving a combination of selpg and my xtopdf toolkit), which is a Python library for PDF creation.

(The xtopdf toolkit contains a PDF creation library, and also includes some sample applications that show how to use the library to create PDF output in various ways, and from various input sources, which is why I tend to call xtopdf a toolkit instead of just a library.

I had written one such application of xtopdf a while ago, called StdinToPDF(.py) (for standard input to PDF). I blogged about it at the time, here:

[xtopdf] PDFWriter can create PDF from standard input. (PDFWriter is a module of xtopdf, which provides the core PDF creation functionality.)

The selpg utility can be used with StdinToPDF, in a pipeline, to select a range of pages (by starting and ending page numbers) from a (possibly large) text file, and write only those selected pages to a PDF file. Here is an example of how to do that:

First, build the selpg utility from source, for your Linux OS. selpg is only meant to work on Linux, since it uses some Linux C standard library functions, such as from stdio.h, and popen(); but you can try to run it on Windows (at your own risk), since Windows does have (had?) a POSIX subsystem, from Windows NT onward. I have used it in the past. (Update: I checked - according to this section of the Wikipedia article about POSIX, Windows may have had POSIX support only from Windows NT up to Windows 2000.) Anyway, to build selpg on Linux, follow the steps below (the $ sign is the shell prompt and not to be typed):

1. Download the source code from the sources section of the selpg project repository on Bitbucket.

Download all of these files: makefile, mk, selpg.c and showsyserr.c .

2. Make the (shell script) file mk executable, with the command:

$ chmod u+x mk

3. Then run the file mk, with:

$ ./mk

That will run the makefile that builds the selpg executable using the C compiler on your Linux box. The C compiler (invoked as cc or gcc) is installed on most mainstream Linux distributions. If it is not, you will need to install it from the repository for your Linux distribution. Sometimes only a minimal version of a C compiler is installed, which is only enough to (re)compile the kernel after making kernel parameter changes, such as for performance tuning. Consult your local Linux expert for help if such is the case.

3. Now make the file selpg executable, with the command:

$ chmod u+x selpg

4. (Optional) You can check the usage of selpg by reading the IBM tutorial article and/or running selpg without any command-line arguments:

$ ./selpg

which will show a usage message.

6. (Optional) You can run selpg a few times with some text file(s) as input, and different values for the -s and -e command-line options, to get a feel for how it works.

Now download xtopdf (which includes StdinToPDF) from here:

xtopdf on Bitbucket.

To install it, follow the steps given in this post:

Guide to installing and using xtopdf, including creating simple PDF e-books

That post was written a while ago, when xtopdf was hosted on SourceForge. So you need to make one change to the instructions given in that guide: instead of downloading xtopdf from SourceForge, as stated in Step 5 of the guide, get it from the xtopdf Bitbucket link I gave above.

(To make xtopdf work, you also have to install ReportLab, which xtopdf depends uses internally; the steps for that are given in my xtopdf installation guide linked above, or you can also look at the instructions in the ReportLab distribution. It is easy, just a couple of steps - download, unzip, configure a setting or two.)

Once you have both selpg and xtopdf installed, you can use selpg and StdinToPDF together. Here is an example run, to select only pages 2 through 4 from an input text file:

I wrote a simple Python program, gen_selpg_test_file,py, to create a text file that can be used to test the selpg and StdinToPDf programs together.

Here is an excerpt of the core logic of gen_selpg_test_file.py, omitting argument and error handling for brevity (I have those in the actual code):

# Generate the test file with the given filename and number of lines of text.
    try:
        out_fil = open(out_filename, "w")
    except IOError as ioe:
        sys.stderr.write("Error: Could not open output file {}.\n".format(out_filename))
        sys.exit(1)
    for line_num in range(1, num_lines + 1):
        line = "Line #" + str(line_num).zfill(10) + "\n"
        out_fil.write(line)
    out_fil.close()

I ran it like this:

$ python gen_selpg_test_file.py selpg_test_file_1000.txt 1000

to generate a text file with 1000 lines, in the file selpg_test_file_1000.txt .

Then I could run the pipeline using selpg and StdinToPDF, as described above:

$ ./selpg -s2 -e4 selpg_test_file_1000.txt | python StdinToPDF.py p2-p4.pdf

This command extracts only the specifed pages (2 to 4) from the input file, and pipes them to StdinToPDF, which converts those pages only, to PDF, in the filename specified at the end of the command.

After doing the above, you can open the file p2_p4.pdf in your favorite PDF reader (Evince is one PDF reader for Linux), to confirm that it contains all (and only) the lines from page 2 to 4 of the input file selpg_test_file_1000.txt (considering 72 lines per page, which is the default that selpg uses).

Read the IBM article to see how that default can be changed - to either another number of lines per page, e.g. 66 or 80 or whatever, or to specify form feeds (ASCII code 12) as the page delimiter. Form feeds are often used as a page delimiter in text file reports generated by programs, when the reports are destined for a printer, since the form feed character causes the printer to advance the print head to the top of the next page/form (that's how the character got its name).

Though this post seemed long, note that a lot it was either background information or instructions on how to build selpg and install xtopdf. Those are both one time jobs. Once those are done, you can select the needed pages from any text file and print them to PDF with a single command-line, as shown in the last command above.

This is useful when you printed the entire file earlier, and some pages didn't print properly because the printer jammed. Just use selpg with xtopdf to print only the needed pages again.

The image above is from the Wikipedia article on Printing, and titled:

Jikji, "Selected Teachings of Buddhist Sages and Son Masters" from Korea, the earliest known book printed with movable metal type, 1377. Bibliothèque Nationale de France, Paris

- Enjoy.

- Vasudev Ram - Dancing Bison Enterprises