Tuesday, August 7, 2012

PDFTextStream now free for single-thread apps

By Vasudev Ram



PDFTextStream is a library for extraction of text from PDF files. According to their site, "PDFTextStream is used by companies and governments around the world to process billions of documents yearly".

UPDATE: Just saw this on the PDFTextStream site - which should make it of interest to a broader audience:

[ Being a Java library, PDFTextStream may be used by any JVM language that supports interoperability with Java APIs, including Clojure, Scala, Groovy, JRuby, Jython, and so on. ]

PDFTextStream is a product of Snowtide Informatics Systems, Inc., which was founded by Chas Emerick. Chas is also a co-author of Clojure Programming, an O'Reilly book about the Clojure programming language.

I first researched libraries for PDF text extraction some years ago, for a consulting project. I did not use PDFTextStream then, because I had not come across it. (I ended up using xpdf, which is also quite a good library (in and for C), and had good technical support - they even quickly fixed a bug or two that I found while trying it.)

But even after that project was over, out of interest, I would search for other such libraries once in a while, and so I came across PDFTextStream.

It was a somewhat expensive library then, IIRC, and still is. But now it is now free for use in single-threaded applications.

It has versions for the Java JVM and .NET.

Download PDFTextStream here.

- Vasudev Ram - Dancing Bison Enterprises

No comments: