What is HIVE?
The Apache Hive is a data warehousing package built on top of Hadoop. It provides an SQL dialect, called Hive Query Language (HQL) for querying or summarizing data stored in a Hadoop cluster. Hive doesn’t support for row level inserts, updates, and deletes. It also doesn’t support transactions. Hive allows extensions to provide better performance in the name of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing like Document indexing, Text mining, Customer-facing business intelligence, Predictive modeling etc… HQL statements are broken down because of the Hive service into MapReduce jobs and executed across a Hadoop cluster.
As with any database management system, you can run Hive queries in many ways like from a command line interface which is known as the Hive shell, from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC). The Hive Thrift Client is much like any other database client that gets installed on a user’s client machine. It communicates and interacts with the Hive services running on the server. You can use the Hive Thrift Client within software applications written in C++, Java, PHP, Python, or Ruby.
Hive looks like traditional database code with SQL access. However Hive is based on Hadoop and MapReduce operations, there are many key differences. The first is that Hadoop is known for long sequential scans and because As Hive is based on Hadoop, you can expect queries to have very high responses for many minutes. This means Hive would not be appropriate for applications that which need very fast responses. Finally, Hive is read-based and it is not appropriate for transaction processing that typically involves a high percentage of writing operations.
Hive provides standard SQL functionality features, including many of later 2003 and 2011 features for analytics. Hive’s SQL can also be extended with user codes via user-defined functions (UDFs) and user-defined aggregate functions (UDAFs). There is not a single format (Hive format) in which data must be stored. Hive comes with built-in connectors for comma-separated value (CSV) text files like Apache Parquet, Apache ORC, and other formats. Users can extend Hive by using connectors for other formats.
The Apache Hive is not designed for online transaction processing workloads (OLTP) and does not offer real-time updates. It is best used for traditional data storage tasks and batch jobs over large sets of append-only data. Hive is designed to maximize scalability, performance, extensibility, and loose-coupling with its input formats.
Other features of Hive include:
- Indexing to provide more acceleration, index type including compaction and Bitmap index as 0.10 more index type are planned.
- Different storage types such as plain text, HBase, RC file and others are included in Hive.
- Metadata storage in an RDBMS reduces the time to perform the semantic checks during the query operation.
- Operating on compressed data stored in the Hadoop ecosystem using algorithms including DEFLATE, Snappy, etc. are included in Hive.
- SQL-like queries (HiveQL) which are implicitly converted into MapReduce or Tez, or Spark jobs.