Thursday, August 23, 2012

Fulltext, Python library to convert documents and media to text - for full-text indexing for search

Fulltext is a simple Python library for converting document and media files to text. It's main purpose is for use with full-text indexing systems.


and (Site giving an error at present)

For example, to easily extract text from a PDF file:

> python
> import fulltext
> fulltext.get('resume.pdf')
'Experience: ...'

Excerpt from the github site for fulltext:

[ Fulltext is a library that makes converting various file formats to plain text simple. Mostly it is a wrapper around shell tools. It will execute the shell program, scrape it's results and then post-process the results to pack as much text into as little space as possible.

Supported formats:
The following formats are supported using the command line apps listed.

application/pdf: pdftotext
application/msword: antiword
application/ convertxls2csv
application/rtf: unrtf
application/vnd.oasis.opendocument.text: odt2txt
application/vnd.oasis.opendocument.spreadsheet: odt2txt
application/zip: funzip
application/x-tar, gzip: tar & gunzip
application/x-tar, bzip2: tar & bunzip2
application/rar: unrar
text/html: html2text
text/xml: html2text
image/jpeg: exiftool
video/mpeg: exiftool
audio/mpeg: exiftool
application/octet-stream: strings ]

Inspired by nature.
- | @vasudevram |

No comments: