Monday, August 13, 2012

Inferno on Disco, Python MapReduce library / daemon for structured text

By Vasudev Ram


Inferno is an open-source Python MapReduce library. It has (from the site):

[ A query language for large amounts of structured text (CSV, JSON, etc).

A continuous and scheduled MapReduce daemon with an HTTP interface that automatically launches MapReduce jobs to handle a constant stream of incoming data. ]

Overview of Inferno.

This overview page has a nice serial example: starting with a small set of test data, it shows how to query for a certain result, in SQL and then in AWK (both are easy one-liners), but then goes on to show how the achieve the same result using Inferno.

The interesting point is that the Inferno code is also small (a "rule" of ~10 lines, presumably stored in a config file) and a one-line command, but the difference from the SQL and AWK examples is that this runs a Disco MapReduce job to distribute the work across the nodes on a cluster. There is almost nothing in the Inferno code to indicate that this is a distributed computing MapReduce job.

Inferno uses Disco.

Disco is "a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data."

Some users of Disco: (Chango, Nokia, Zemanta). Chango staff seem to be the developers of Disco.

- Vasudev Ram - Dancing Bison Enterprises

No comments: