PDFTextStream is a library for extraction of text from PDF files. According to their site, "PDFTextStream is used by companies and governments around the world to process billions of documents yearly".
UPDATE: Just saw this on the PDFTextStream site - which should make it of interest to a broader audience:
[ Being a Java library, PDFTextStream may be used by any JVM language that supports interoperability with Java APIs, including Clojure, Scala, Groovy, JRuby, Jython, and so on. ]
PDFTextStream is a product of Snowtide Informatics Systems, Inc., which was founded by Chas Emerick. Chas is also a co-author of Clojure Programming, an O'Reilly book about the Clojure programming language.
I first researched libraries for PDF text extraction some years ago, for a consulting project. I did not use PDFTextStream then, because I had not come across it. (I ended up using xpdf, which is also quite a good library (in and for C), and had good technical support - they even quickly fixed a bug or two that I found while trying it.)
But even after that project was over, out of interest, I would search for other such libraries once in a while, and so I came across PDFTextStream.
It was a somewhat expensive library then, IIRC, and still is. But now it is now free for use in single-threaded applications.
It has versions for the Java JVM and .NET.
Download PDFTextStream here.
- Vasudev Ram - Dancing Bison Enterprises