A Short Bite into Apache Hive

Creativity and science work together with the best in the face of intense inconvenience or intense need. That is how Hive came into existence.

In 2008, Hadoop became an Apache project with its stable release in Yahoo.

The Hive was first developed at Facebook when its employees found it too hard to query data directly from Hadoop’s file system with MapReduce.

Thus Hive was developed to query data using its HiveQL. By 2008 Hive became an Apache open source project.

Hive sits on top of the MapReduce and Hadoop file system.

The user writes queries which are converted to optimized MapReduce codes and then executed by the Hive engine. Hive behaves like a data warehouse where we cannot update or delete data.

Even though HiveQL is very similar to SQL, it does not support full ANSI SQL features. This has caused performance issues.

Organizations not only want tools that give a high performance but are also easier to use with interactive with low error rates and those are easier to integrate with BI tools like Tableau. This is another area where Hive was lacking.

A legion of open-source and enterprise tools for querying data from Hadoop file system, local memory, the cloud locations exist since Hive.

Cloudera released Impala for directly querying data surpassing MapReduce.

Presto is such an open source tool that provides interactive querying on petabytes of data.

Apache Drill is another interesting open source project used for accessing data in the file system. Most querying tools on Hadoop requires a schema to be declared and Apache Drill does not need that.

Vertica Analytics Platform is not open source but is widely used in large organizations for interactive high-performance querying.

Apache Kylin developed at eBay is a distributed query engine designed to reduce query latency to sub-second latency and it provides integration with several BI tools. It also has most ANSI SQL features that help with multi-dimension analysis on large volumes of data.

Redshift developed by Amazon based on PostgreSQL also provides analytics on big data.

There are many alternatives Hive these days and organizations are quick to adopt them because of Hive’s issues with total SQL compatibility, high query latency, and no real time queries.