Apache Pig and How to Process Data Using Apache Pig
Pig is a high-level scripting language that is used in processing data by the users of Apache Hadoop. Therefore, even without learning other programming languages like Java, the data workers are still able to achieve data processing. If you are familiar with scripting languages and SQL, the task is more appealing since Pig Latin is a simple scripting language that`s used in Pig Apache.
You can achieve complete processing of data simply since the high-level scripting language is complete. Additionally, Pig offers added advantage since you can use the code in most major programming languages such as JRuby, Python and even Java. Therefore, using Pig Apache, you will be able to achieve large and complex applications that will handle the business problems. But how do you process data using Apache Pig?
Steps to Follow When Processing Data with Apache Pig
Step 1: You first need to first have the path where you need the data to be downloaded from. Then once you download the data, you will unzip it into a directory that you can refer to when processing. Therefore you should make sure you can remember where the data will go after you have unzipped.
Step 2: Upload the Data So That You Can Process
Remember that format should be in HDP format, and this file system is normally separate from the LFS. Therefore, select the HDFS file view using the menu on your screen. Follow the normal uploading procedure which is clicking on the upload button and then the browse button to search for the file. Select the appropriate file that you downloaded. You should notice that the uploaded files will now be in a different formatwhich will be HDFS.
Step 3: Open Pig to Create Your Script
Click on the pig view in the menu, and the Pig interface will show. On the left side, you will notice the saved scripts and the jobs that were done previously if there are any. The right of the interface shows the saved scripts. Now, clicks the button showing the new script to begin the processing. Name the script by editing the “truck-events” field. Leave the HDFS location blank. Click on the ‘create’ button and go to next step.
Step 4: Define Relation
After creating the script, you now need to define the relation. There are two steps to do this, and one defines the relation named truck events and the next will be describing the command so as to view the truck events relation. After all, is set, save the script.
Step 5: Click Execute to Run the Script
After clicking the execute button you will notice that this leads to one or more map reduce jobs. There will be a change on the page too. You can end the process by clicking kill button or you can check the progress on the bar.
View the Data after Process is Complete
The dump command is used to view the data, and you should add it to the Pig script, save and execute it. You can sort it using the order by option.
Note that Apache Pig and Hive are different and the decision to use either depends on the purpose and the type of data.