python-magic is a Python wrapper for the libmagic C library which allows you to detect the type of a file by reading and deciphering the initial part of its contents, and/or by using the magic number database for file types. The Unix command called file uses libmagic internally. When you give the command:
$ file *at a Unix command prompt, it gives you output showing its guess (using libmagic) as to the type of each file in the current directory (because the * is a wildcard that matches all the filenames in the current directory).
For example, if there are 10 files in the directory, it may detect and tell you that the 1st file is a text file, the 2nd is the source code of a C program, the 3rd is the object (compiled) code of that C program, the 4th is a PDF file, the 5th is an HTML file, the 6th is a Linux executable (which may be the end result of linking the object code mentioned earlier with some standard libraries), and so on.
Here is a simple example showing the use of the python-magic library:
>>> import magic >>> magic.from_file("testdata/test.pdf") 'PDF document, version 1.2' >>> magic.from_buffer(open("testdata/test.pdf").read(1024)) 'PDF document, version 1.2' >>> magic.from_file("testdata/test.pdf", mime=True) 'application/pdf'Here is an example program that reads the list of files in the current directory, and for each file, prints the filename, the file type and the file's MIME type.
(I used the term MIME type loosely; it should really be called Internet media type.)
import os import magic from magic import from_file def do_magic(filename): file_type = from_file(filename) mime_type = from_file(filename, mime=True) print "{}: {} | {}".format(filename, file_type, mime_type) print "filename: file_type | mime_type" for filename in os.listdir('.'): do_magic(filename)Example program output:
filename: file_type | mime_type awk: directory | inode/directory awk.tar: POSIX tar archive (GNU) | application/x-tar echoer: ASCII text | text/plain echoer.sh: ASCII text | text/plain pdf_cherry.py.pdf: PDF document, version 1.3 | application/pdf prog1.c: ASCII C program text | text/x-c prog1.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped | application/x-object prog2.c: ASCII C program text | text/x-c prog2.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), stripped | application/x-object reportlab-1.21.1: directory | inode/directory selpg: directory | inode/directory test1.tar.gz: gzip compressed data, was "test1.tar", from Unix, last modified: Mon Oct 13 19:50:01 2014 | application/x-gzip test_python_magic.py: Python script, ASCII text executable | text/x-python test_python_magic2.py: Python script, ASCII text executable | text/x-python text_file.txt: ASCII text | text/plain tpm.out: ASCII text | text/plain tpm2.out: empty | inode/x-empty xtopdf: directory | inode/directorySo the python-magic library can be useful, since it allows us to detect the type of a file (correctly most of the time) from within our Python code, and then do something meaningful with that information.
For example, a program that reads all the files under a directory tree, can be made to do the right kind of processing with each type of file, based on the file type it detects using python-magic.
Abracadabra!
- Vasudev Ram - Dancing Bison EnterprisesClick here to signup for email notifications about new products and services from Vasudev Ram. Contact Page
3 comments:
Hello! I've uploaded this library to PyPI and created a more easy API. Please look my blog post about it: http://blog.justen.eng.br/2016/02/detecting-file-type-and-encoding-in-python.html
Hi,
Thanks for letting me know about this. Interesting.
Will check out your improved version of the library.
Incidentally, reading about this reminded me of something, so I googled it, and found (as I remembered) that the creator of the Unix file command is Ian Darwin, who is also the author of the Java Cookbook by O'Reilly, which I've read (it's good), and some other books too. Just FYI.
Vasudev
Also, your rows library looks interesting. Will check it out.
Post a Comment