Wednesday, October 15, 2014

Let's do some magic with Python

By Vasudev Ram



python-magic is a Python wrapper for the libmagic C library which allows you to detect the type of a file by reading and deciphering the initial part of its contents, and/or by using the magic number database for file types. The Unix command called file uses libmagic internally. When you give the command:
$ file *
at a Unix command prompt, it gives you output showing its guess (using libmagic) as to the type of each file in the current directory (because the * is a wildcard that matches all the filenames in the current directory).

For example, if there are 10 files in the directory, it may detect and tell you that the 1st file is a text file, the 2nd is the source code of a C program, the 3rd is the object (compiled) code of that C program, the 4th is a PDF file, the 5th is an HTML file, the 6th is a Linux executable (which may be the end result of linking the object code mentioned earlier with some standard libraries), and so on.

Here is a simple example showing the use of the python-magic library:
>>> import magic
>>> magic.from_file("testdata/test.pdf")
'PDF document, version 1.2'
>>> magic.from_buffer(open("testdata/test.pdf").read(1024))
'PDF document, version 1.2'
>>> magic.from_file("testdata/test.pdf", mime=True)
'application/pdf'
Here is an example program that reads the list of files in the current directory, and for each file, prints the filename, the file type and the file's MIME type.
(I used the term MIME type loosely; it should really be called Internet media type.)

import os
import magic
from magic import from_file

def do_magic(filename):
    file_type = from_file(filename)
    mime_type = from_file(filename, mime=True)
    print "{}: {} | {}".format(filename, file_type, mime_type)

print "filename: file_type | mime_type"
for filename in os.listdir('.'):
    do_magic(filename)
Example program output:
filename: file_type | mime_type
awk: directory | inode/directory
awk.tar: POSIX tar archive (GNU) | application/x-tar
echoer: ASCII text | text/plain
echoer.sh: ASCII text | text/plain
pdf_cherry.py.pdf: PDF document, version 1.3 | application/pdf
prog1.c: ASCII C program text | text/x-c
prog1.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped | application/x-object
prog2.c: ASCII C program text | text/x-c
prog2.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), stripped | application/x-object
reportlab-1.21.1: directory | inode/directory
selpg: directory | inode/directory
test1.tar.gz: gzip compressed data, was "test1.tar", from Unix, last modified: Mon Oct 13 19:50:01 2014 | application/x-gzip
test_python_magic.py: Python script, ASCII text executable | text/x-python
test_python_magic2.py: Python script, ASCII text executable | text/x-python
text_file.txt: ASCII text | text/plain
tpm.out: ASCII text | text/plain
tpm2.out: empty | inode/x-empty
xtopdf: directory | inode/directory
So the python-magic library can be useful, since it allows us to detect the type of a file (correctly most of the time) from within our Python code, and then do something meaningful with that information.

For example, a program that reads all the files under a directory tree, can be made to do the right kind of processing with each type of file, based on the file type it detects using python-magic.

Abracadabra!

- Vasudev Ram - Dancing Bison Enterprises

Click here to signup for email notifications about new products and services from Vasudev Ram.

Contact Page

3 comments:

Álvaro Justen "Turicas" said...

Hello! I've uploaded this library to PyPI and created a more easy API. Please look my blog post about it: http://blog.justen.eng.br/2016/02/detecting-file-type-and-encoding-in-python.html

Vasudev Ram said...

Hi,

Thanks for letting me know about this. Interesting.

Will check out your improved version of the library.

Incidentally, reading about this reminded me of something, so I googled it, and found (as I remembered) that the creator of the Unix file command is Ian Darwin, who is also the author of the Java Cookbook by O'Reilly, which I've read (it's good), and some other books too. Just FYI.

Vasudev

Vasudev Ram said...

Also, your rows library looks interesting. Will check it out.