Thursday, August 23, 2012

Fulltext, Python library to convert documents and media to text - for full-text indexing for search


Fulltext is a simple Python library for converting document and media files to text. It's main purpose is for use with full-text indexing systems.

See: https://github.com/btimby/fulltext

and

http://pypi.python.org/pypi/fulltext/0.1-1 (Site giving an error at present)

For example, to easily extract text from a PDF file:

> python
> import fulltext
> fulltext.get('resume.pdf')
'Experience: ...'

Excerpt from the github site for fulltext:

[ Fulltext is a library that makes converting various file formats to plain text simple. Mostly it is a wrapper around shell tools. It will execute the shell program, scrape it's results and then post-process the results to pack as much text into as little space as possible.

Supported formats:
The following formats are supported using the command line apps listed.

application/pdf: pdftotext
application/msword: antiword
application/vnd.openxmlformats-officedocument.wordprocessingml.document:
docx2txt
application/vnd.ms-excel: convertxls2csv
application/rtf: unrtf
application/vnd.oasis.opendocument.text: odt2txt
application/vnd.oasis.opendocument.spreadsheet: odt2txt
application/zip: funzip
application/x-tar, gzip: tar & gunzip
application/x-tar, bzip2: tar & bunzip2
application/rar: unrar
text/html: html2text
text/xml: html2text
image/jpeg: exiftool
video/mpeg: exiftool
audio/mpeg: exiftool
application/octet-stream: strings ]

Inspired by nature.
- dancingbison.com | @vasudevram | jugad2.blogspot.com

No comments: