Drill into Your Big Data Today with Apache Drill
Big data techniques are becoming mainstream in an increasing number of businesses, but how do people get self-service, interactive access to their big data? And how do they do this without having to train their SQL-literate employees to be advanced developers?
One solution is to take advantage of the rapidly maturing open source, open community software tool known as Apache Drill. Drill is not the first SQL-on-Hadoop tool. It is, however, a new and very sophisticated highly scalable SQL query engine that has been built from the ground up to be appropriate for use even in production settings. Drill extends query capabilities to a variety of new data sources and formats without the requirement for IT intervention that might be expected from a SQL query engine. In short, Drill allows self-exploration of data by providing flexibility along with performance.
Why Drill is compelling for customers
1) Drill provides SQL access on any type of data, with extreme flexibility and ease of use
With Drill, you can query data in files, a Hive data warehouse, HBase tables, or even non-Hadoop based storage systems in just a few minutes, and you can combine data from these sources on the fly.
2) Drill provides low latency performance at scale
Drill is a distributed and columnar SQL query engine built from the ground up for complex data. It doesn’t use MapReduce, Tez, or Spark. Drill can be deployed on a single node or can be horizontally scaled to 10s to 100s to 1000s of nodes, depending on the number of users that need to be supported, performance SLAs to be met, and the amount of data you that needs processing. Along with scale, Drill is built for performance.
3) Drill provides a granular and de-centralized security model
Drill supports user impersonation, so the specific user identity can be used to access these views instead of system or process users accessing the data, which is not acceptable in several user environments. Drill also offers powerful ownership-chaining capabilities that control how many levels of nested views a given user can access, so organizations can strike a balance between self-service data exploration with controlled governance.
Drill’s architecture is made up of four components:
- Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery. It will also support Full ANSI SQL:2003.
- Low-latency distributed execution engine: This is Drill’s heart. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines such as Dremel, Dryad, Hyracks, CIEL, Stratosphere, and columnar storage.
- Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON (Binary JSON,) and YAML.
- Scalable data sources: This layer is responsible for supporting data sources. The initial focus is to leverage Hadoop as a data source.