Thursday, August 23, 2012
Fulltext, Python library to convert documents and media to text - for full-text indexing for search
Fulltext is a simple Python library for converting document and media files to text. It's main purpose is for use with full-text indexing systems.
See: https://github.com/btimby/fulltext
and
http://pypi.python.org/pypi/fulltext/0.1-1 (Site giving an error at present)
For example, to easily extract text from a PDF file:
> python
> import fulltext
> fulltext.get('resume.pdf')
'Experience: ...'
Excerpt from the github site for fulltext:
[ Fulltext is a library that makes converting various file formats to plain text simple. Mostly it is a wrapper around shell tools. It will execute the shell program, scrape it's results and then post-process the results to pack as much text into as little space as possible.
Supported formats:
The following formats are supported using the command line apps listed.
application/pdf: pdftotext
application/msword: antiword
application/vnd.openxmlformats-officedocument.wordprocessingml.document:
docx2txt
application/vnd.ms-excel: convertxls2csv
application/rtf: unrtf
application/vnd.oasis.opendocument.text: odt2txt
application/vnd.oasis.opendocument.spreadsheet: odt2txt
application/zip: funzip
application/x-tar, gzip: tar & gunzip
application/x-tar, bzip2: tar & bunzip2
application/rar: unrar
text/html: html2text
text/xml: html2text
image/jpeg: exiftool
video/mpeg: exiftool
audio/mpeg: exiftool
application/octet-stream: strings ]
Inspired by nature.
- dancingbison.com | @vasudevram | jugad2.blogspot.com
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment