What is Oozie Workflow?
Apache Oozie is an open source project which is based on Java Web application technology that simplifies the process of creating workflows and manages coordination among jobs. In principle, Oozie has the ability to combine multiple jobs sequentially into a single logical unit of work. One of the advantages of the Oozie framework is that it is integrated with the Apache Hadoop stack with YARN as its architectural center and supports Hadoop jobs for Apache MapReduce, Apache Hive, and Apache Sqoop. In addition to that, it can be used to schedule jobs specific to a system, such as Java programs or shell scripts. Therefore, using Oozie, Hadoop administrators are able to build complex data transformations that can combine the process of different individual tasks and even sub-workflows. This helps to improve the ability for a greater control over complex jobs and makes it easier to repeat those jobs at certain periods of time.
In practice, there are different types of Oozie jobs:
- An Oozie Workflow jobs are directed as Acyclical graphs (DAGs) which are used to specify a sequence of actions that which are to be executed.
- Oozie Coordinator jobs are recurrent Oozie workflow jobs that are operated by time and data availability.
- Oozie Bundle allows packaging multiple coordinators, workflow jobs and makes it easier to manage the lifecycle of those jobs.
An Oozie workflow is a collection of actions arranged in a directed acyclic graph (DAG). This graph contains two types of nodes, control nodes, and action nodes. The first node is the one which is used to define job chronology, provide the rules for beginning and ending a workflow and control the workflow execution path with possible decision paths known as fork and join nodes. The other node is used to trigger the execution of tasks. In some particular it can be a MapReduce job, a pig application, a file stack system task, or a java application.
Oozie is a native Hadoop stack integrator that supports all types of Hadoop jobs and is integrated with the Hadoop stack. Oozie is responsible for triggering the workflow actions, where the actual execution of tasks is done using Hadoop MapReduce. Therefore, Oozie becomes able to leverage existing Hadoop machinery for load balancing, fail-over. Oozie detects completion of tasks through callback and polling. When Oozie starts a task it provides a unique call back HTTP URL to the task and notifies that URL when it is complete. Oozie workflow definition language is XML-based and it is known as the Hadoop Process Definition Language. Oozie comes with a command-line program; this command-line program interacts with Oozie server using REST.
Oozie helps the Hadoop users in chaining and automating the execution of big data processing tasks into a defined workflow is quite a useful feature in real-world practices. Oozie workflows can be parameterized using variables within the workflow definition. When submitting a workflow, job values for the parameters also must be provided. If properly parameterized using different output directions, several identical workflow jobs can run concurrently.