Thursday, January 23, 2014

Cloudera's Impala engine - SQL querying of Hadoop data

By Vasudev Ram


I had blogged a while ago about SQL coming to Hadoop, citing a GigaOm article. That article had also mentioned Cloudera's Impala product as one of the strong contenders in this area.

Ckoudera Impala is an open source SQL query engine that can operate directly on Hadoop data; there is no need to extract the data into an RDBMS. They also plan to support Business Intelligence tools.

Here is the original announcement of Impala from Cloudera:

Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real

Interestingly, they mention in that announcement, that Google's Dremel paper was one thing that inspired them to create Impala.

I had blogged about Dremel earlier:

Drill by Apache, like Google Dremel

More on Google Dremel - Wired article

These are the key benefits and features of Cloudera Impala (excerpt from their page, emphasis mine), that I found interesting:

[
Key Benefits of Impala

Speed to Insight
Perform interactive analytics directly on data stored in Hadoop. Get answers as quickly as you can ask questions, without the bottlenecks caused by data movement and jumping between data silos.

Cost Savings
Reduce data movement as well as duplicate storage with specialized systems by performing interactive analysis directly on full fidelity data.

Full Fidelity Analysis
Ask questions of all your data - without loss of fidelity from aggregations or conforming to fixed schemas.

Familiarity
Leverage existing BI tools and employee skill sets (SQL) to interact with data stored in Hadoop.

Discoverability
Enable more users to interact with more data by providing a single repository and metadata store from source to analysis.

Unification
Leverage the same file and data formats, metadata, security and resource management frameworks you use for the rest of the Hadoop system.


Key Features of Impala

SQL queries on CDH in seconds

Native MPP query engine

Integration with leading BI tools

Support for HDFS and HBase

Support for a wide variety of file formats including text, SequenceFiles, Avro, RCFile, LZO and Parquet

In-memory data transfers

Leverages metadata, ODBC driver, SQL syntax and Beeswax GUI (in Hue) from Apache Hive

Kerberos authentication

Fine-grained, role-based authorization with Sentry

100% open source (Apache licensed)

]

Here are a few related interesting posts:

Cloudera gets $65 mil more to grow Hadoop based Big Data offerings

On GigaOm: Cloudera makes SQL a first-class citizen in Hadoop">Cloudera makes SQL a first-class citizen in Hadoop

Cloudera Touts Near Linear Scalability with Impala

And here is a video of a technical deep dive into Cloudera Impala, on their site.

Read other posts about Big Data on my blog.

Check out this photo of an impala with cheetahs:


Excerpts from the Wikipedia page about the cheetah:

[ The cheetah is a large feline inhabiting most of Africa and parts of the Middle East. The cheetah can run faster than any other land animal— as fast as 112 to 120 km/h (70 to 75 mph) in short bursts, and has the ability to accelerate from 0 to 100 km/h (62 mph) in three seconds. ]

Maybe Cloudera should have named Impala as Cheetah instead :)

- Vasudev Ram - Dancing Bison Enterprises


Contact Page

No comments: