Friday, October 24, 2014

Print selected text pages to PDF with Python, selpg and xtopdf on Linux

By Vasudev Ram



In a recent blog post, titled My IBM developerWorks article, I talked about a tutorial that I had written for IBM developerWorks a while ago. The tutorial showed some of the recommended techniques and practices to follow when writing a Linux command-line utility that is intended for production use, and how to write it in such a way that it can easily cooperate with existing UNIX command-line tools, when used in a UNIX command pipeline.

This ability of properly written command-line tools to cooperate with each other when used in a pipeline, is, as I said in that IBM article, one of the keys to the power of Linux (and UNIX) as a development environment. (See the classic book The UNIX Programming Environment, for much more on this topic.)

The utility I wrote and discussed (in that IBM article), called selpg (for SELect PaGes), allows the user to select a specified range of pages from a text file. At the end of the aforementioned blog post, I had said that I would show some practical uses of the selpg utility later. I describe one such use case below, involving a combination of selpg and my xtopdf toolkit), which is a Python library for PDF creation.

(The xtopdf toolkit contains a PDF creation library, and also includes some sample applications that show how to use the library to create PDF output in various ways, and from various input sources, which is why I tend to call xtopdf a toolkit instead of just a library.

I had written one such application of xtopdf a while ago, called StdinToPDF(.py) (for standard input to PDF). I blogged about it at the time, here:

[xtopdf] PDFWriter can create PDF from standard input. (PDFWriter is a module of xtopdf, which provides the core PDF creation functionality.)

The selpg utility can be used with StdinToPDF, in a pipeline, to select a range of pages (by starting and ending page numbers) from a (possibly large) text file, and write only those selected pages to a PDF file. Here is an example of how to do that:

First, build the selpg utility from source, for your Linux OS. selpg is only meant to work on Linux, since it uses some Linux C standard library functions, such as from stdio.h, and popen(); but you can try to run it on Windows (at your own risk), since Windows does have (had?) a POSIX subsystem, from Windows NT onward. I have used it in the past. (Update: I checked - according to this section of the Wikipedia article about POSIX, Windows may have had POSIX support only from Windows NT up to Windows 2000.) Anyway, to build selpg on Linux, follow the steps below (the $ sign is the shell prompt and not to be typed):

1. Download the source code from the sources section of the selpg project repository on Bitbucket.

Download all of these files: makefile, mk, selpg.c and showsyserr.c .

2. Make the (shell script) file mk executable, with the command:
$ chmod u+x mk
3. Then run the file mk, with:
$ ./mk
That will run the makefile that builds the selpg executable using the C compiler on your Linux box. The C compiler (invoked as cc or gcc) is installed on most mainstream Linux distributions. If it is not, you will need to install it from the repository for your Linux distribution. Sometimes only a minimal version of a C compiler is installed, which is only enough to (re)compile the kernel after making kernel parameter changes, such as for performance tuning. Consult your local Linux expert for help if such is the case.

3. Now make the file selpg executable, with the command:
$ chmod u+x selpg
4. (Optional) You can check the usage of selpg by reading the IBM tutorial article and/or running selpg without any command-line arguments:
$ ./selpg
which will show a usage message.

6. (Optional) You can run selpg a few times with some text file(s) as input, and different values for the -s and -e command-line options, to get a feel for how it works.

Now download xtopdf (which includes StdinToPDF) from here:

xtopdf on Bitbucket.

To install it, follow the steps given in this post:

Guide to installing and using xtopdf, including creating simple PDF e-books

That post was written a while ago, when xtopdf was hosted on SourceForge. So you need to make one change to the instructions given in that guide: instead of downloading xtopdf from SourceForge, as stated in Step 5 of the guide, get it from the xtopdf Bitbucket link I gave above.

(To make xtopdf work, you also have to install ReportLab, which xtopdf depends uses internally; the steps for that are given in my xtopdf installation guide linked above, or you can also look at the instructions in the ReportLab distribution. It is easy, just a couple of steps - download, unzip, configure a setting or two.)

Once you have both selpg and xtopdf installed, you can use selpg and StdinToPDF together. Here is an example run, to select only pages 2 through 4 from an input text file:

I wrote a simple Python program, gen_selpg_test_file,py, to create a text file that can be used to test the selpg and StdinToPDf programs together.

Here is an excerpt of the core logic of gen_selpg_test_file.py, omitting argument and error handling for brevity (I have those in the actual code):

# Generate the test file with the given filename and number of lines of text.
    try:
        out_fil = open(out_filename, "w")
    except IOError as ioe:
        sys.stderr.write("Error: Could not open output file {}.\n".format(out_filename))
        sys.exit(1)
    for line_num in range(1, num_lines + 1):
        line = "Line #" + str(line_num).zfill(10) + "\n"
        out_fil.write(line)
    out_fil.close()
I ran it like this:
$ python gen_selpg_test_file.py selpg_test_file_1000.txt 1000
to generate a text file with 1000 lines, in the file selpg_test_file_1000.txt .

Then I could run the pipeline using selpg and StdinToPDF, as described above:
$ ./selpg -s2 -e4 selpg_test_file_1000.txt | python StdinToPDF.py p2-p4.pdf
This command extracts only the specifed pages (2 to 4) from the input file, and pipes them to StdinToPDF, which converts those pages only, to PDF, in the filename specified at the end of the command.

After doing the above, you can open the file p2_p4.pdf in your favorite PDF reader (Evince is one PDF reader for Linux), to confirm that it contains all (and only) the lines from page 2 to 4 of the input file selpg_test_file_1000.txt (considering 72 lines per page, which is the default that selpg uses).

Read the IBM article to see how that default can be changed - to either another number of lines per page, e.g. 66 or 80 or whatever, or to specify form feeds (ASCII code 12) as the page delimiter. Form feeds are often used as a page delimiter in text file reports generated by programs, when the reports are destined for a printer, since the form feed character causes the printer to advance the print head to the top of the next page/form (that's how the character got its name).

Though this post seemed long, note that a lot it was either background information or instructions on how to build selpg and install xtopdf. Those are both one time jobs. Once those are done, you can select the needed pages from any text file and print them to PDF with a single command-line, as shown in the last command above.

This is useful when you printed the entire file earlier, and some pages didn't print properly because the printer jammed. Just use selpg with xtopdf to print only the needed pages again.



The image above is from the Wikipedia article on Printing, and titled:

Jikji, "Selected Teachings of Buddhist Sages and Son Masters" from Korea, the earliest known book printed with movable metal type, 1377. Bibliothèque Nationale de France, Paris

- Enjoy.

- Vasudev Ram - Dancing Bison Enterprises

Click here to get email about new products from Vasudev Ram.

Contact Page

Thursday, October 23, 2014

Google Inbox launched, successor to Gmail

By Vasudev Ram


Google has launched a new email product called Google Inbox.

Saw this via Hacker News:

A post about Google Inbox on the official Google blog:

An inbox that works for you.

Hacker News thread about Google Inbox.

Google is going to roll out Inbox in stages to various sets of people. If you want to get an invitation to it, you can email them at inbox@google.com. I did it. Once I get invited, if I find Google Inbox useful or interesting, I will write a post about it.

Meanwhile, here are a few features of Google Inbox mentioned in the official Google blog post:

Bundles (of emails) - like categories that they had before in Gmail.

Highlights - key information from important messages.

Reminders, Assists, and Snoozes.

Assist - if you send a reminder to the hardware store, Assist will tell you its number and if it's open.

Snooze lets you snooze away emails and reminders, until a later time or until you reach another place, like your office.

Interestingly, Google seems to have made a somewhat poor choice of name for the product, again (after doing it with "Go" for the Go language), since in both cases, the word is very common and generic ("inbox" and "Go"), so it will be difficult to search for (even using Google, ironically).

Of course, there are workarounds, like using "golang" instead of "Go", and I'm guessing "Google Inbox" instead of just "Inbox", but those won't work as well as having a more unique name. I just did a Google search for the word "inbox", though, and www.google.com/inbox/ was the first hit.

- Vasudev Ram - Dancing Bison Enterprises

Click here to signup for email notifications about new products and services from Vasudev Ram.

Contact Page

Monday, October 20, 2014

Published my first presentation on SpeakerDeck - using Python

By Vasudev Ram



SpeakerDeck is an online presentation service roughly like SlideShare. SpeakerDeck seems to have been created by Github Inc.

I just published my first presentation on SpeakerDeck. It is a quickstart tutorial for the vi editor. Note: vi, not vim. I had written the tutorial some years ago, when vim was not so widely used, and vi was the most common text editor on Unix systems.

About the tutorial:

I first wrote this vi quickstart tutorial for some friends at a company where I worked. They were Windows and network system administrators without prior Unix experience, and had been tasked with managing some Unix servers that the company had bought for client work. Since I had a Unix background, they asked me to create a quick tutorial on vi for them, which I did.

Later on, after learning the basics of vi from it, and spending some days using vi to edit Unix configuration files, write small shell scripts, etc., they told me that they had found the tutorial useful in getting up to speed on vi quickly.

So, some time later, I thought of publishing it, and sent an article proposal to Linux For You magazine (an Indian print magazine about Linux and open source software). The proposal was accepted and the article was published.

About generating the tutorial as PDF and uploading it to SpeakerDeck:

The original vi quickstart tutorial was in text format. Last year I wrote XMLtoPDFBook (as an application of xtopdf, my Python toolkit for PDF creation), which allows the user to create simple PDF e-books from XML files. So I converted the vi tutorial to XML format (*) and used it to test XMLtoPDFBook. I therefore had the tutorial available in PDF format.

(*) All you have to do for that - i.e. to convert a text file to the XML format supported by XMLtoPDFBook - is to insert each chapter's text as a <chapter> element in the XML file. Then give the XML file as the input to XMLtoPDFBook, and you're done.

SpeakerDeck requires that presentations be uploaded in PDF format. It then converts them to slides. So I thought it would be a good test of SpeakerDeck and/or xtopdf, to upload this PDF generated by xtopdf to SpeakerDeck, and see how the result turned out. I did that today. Then I viewed the resulting SpeakerDeck presentation. It was good to see that the conversion turned out well, AFAICT. All pages seem to have got converted correctly into slides.

The presentation can be viewed here:

A vi quickstart tutorial

If you prefer plain text to presentations, you can read the vi quickstart tutorial here.

- Vasudev Ram - Dancing Bison Enterprises

Click here to signup for email notifications about new products and services from Vasudev Ram.

Contact Page

Wednesday, October 15, 2014

Let's do some magic with Python

By Vasudev Ram



python-magic is a Python wrapper for the libmagic C library which allows you to detect the type of a file by reading and deciphering the initial part of its contents, and/or by using the magic number database for file types. The Unix command called file uses libmagic internally. When you give the command:
$ file *
at a Unix command prompt, it gives you output showing its guess (using libmagic) as to the type of each file in the current directory (because the * is a wildcard that matches all the filenames in the current directory).

For example, if there are 10 files in the directory, it may detect and tell you that the 1st file is a text file, the 2nd is the source code of a C program, the 3rd is the object (compiled) code of that C program, the 4th is a PDF file, the 5th is an HTML file, the 6th is a Linux executable (which may be the end result of linking the object code mentioned earlier with some standard libraries), and so on.

Here is a simple example showing the use of the python-magic library:
>>> import magic
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
Here is an example program that reads the list of files in the current directory, and for each file, prints the filename, the file type and the file's MIME type.
(I used the term MIME type loosely; it should really be called Internet media type.)

import os
import magic
from magic import from_file

def do_magic(filename):
    file_type = from_file(filename)
    mime_type = from_file(filename, mime=True)
    print "{}: {} | {}".format(filename, file_type, mime_type)

print "filename: file_type | mime_type"
for filename in os.listdir('.'):
    do_magic(filename)
Example program output:
filename: file_type | mime_type
awk: directory | inode/directory
awk.tar: POSIX tar archive (GNU) | application/x-tar
echoer: ASCII text | text/plain
echoer.sh: ASCII text | text/plain
pdf_cherry.py.pdf: PDF document, version 1.3 | application/pdf
prog1.c: ASCII C program text | text/x-c
prog1.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped | application/x-object
prog2.c: ASCII C program text | text/x-c
prog2.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), stripped | application/x-object
reportlab-1.21.1: directory | inode/directory
selpg: directory | inode/directory
test1.tar.gz: gzip compressed data, was "test1.tar", from Unix, last modified: Mon Oct 13 19:50:01 2014 | application/x-gzip
test_python_magic.py: Python script, ASCII text executable | text/x-python
test_python_magic2.py: Python script, ASCII text executable | text/x-python
text_file.txt: ASCII text | text/plain
tpm.out: ASCII text | text/plain
tpm2.out: empty | inode/x-empty
xtopdf: directory | inode/directory
So the python-magic library can be useful, since it allows us to detect the type of a file (correctly most of the time) from within our Python code, and then do something meaningful with that information.

For example, a program that reads all the files under a directory tree, can be made to do the right kind of processing with each type of file, based on the file type it detects using python-magic.

Abracadabra!

- Vasudev Ram - Dancing Bison Enterprises

Click here to signup for email notifications about new products and services from Vasudev Ram.

Contact Page

Monday, October 13, 2014

Hacker News thread on PDF reporting tools

By Vasudev Ram

I saw this thread about PDF reporting tools on Hacker News (HN) today:

Ask HN: What do you use for PDF reports these days?

It was interesting to see that multiple HN users commented saying that they use ReportLab for PDF report creation in Python and like it a lot. I also commented, mentioning my xtopdf PDF generation library, which is also written in Python and builds on top of Reportlab, and provides a subset of ReportLab's functionality, with a somewhat easier interface / API for that subset.


PrinceXML (*), Jasper (Java), JagPDF (C++, Python, Java, C), Flying Saucer (Java), PDFBox (Java), prawn (Ruby), wkhtmltopdf, FPDF/TCPDF (PHP) were some of the other interesting PDF creation tools or libraries mentioned. I have come across many of these tools in my explorations of the PDF creation field (which has been going on for some years, as it is a personal interest of mine, and I've also done some consulting projects that involved PDF generation and PDF text extraction), but still came across some tools new to me, in the HN thread.

(*) A possibly somewhat less-known fact is that Håkon Wium Lie, one of the board members of YesLogic, the company behind PrinceXML is also the original proposer of CSS and the CTO of Opera Software (yes, the company behind the Opera browser).

Wikipedia page about PDF - the Portable Document Format.

PDF became an ISO standard - ISO 32000-1 some years ago.

- Vasudev Ram - Dancing Bison Enterprises

Click here to signup for email notifications about new products and services from Vasudev Ram.

Contact Page