登录查看更多内容

Beginner's Guide for Pig

Prateek K.

Data Science

发布日期: 2016年10月4日

Pig was initially developed by Yahoo! to allow individuals using Apache Hadoop to focus more on analyzing large datasets and spend less time in writing mapper and reducer programs.Pig enables users to write complex data analysis code without prior knowledge of Java. Pig’s simple SQL-like scripting language is called Pig Latin and has its own Pig runtime environment where PigLatin programs are executed.

Pig translates the Pig Latin script into MapReduce so that it can be executed within YARN(Yet Another Resource Negotiator, a cluster management technology) to access a single dataset stored in the Hadoop Distributed File System (HDFS).

Pig has two execution modes. They are:

MapReduce Mode
Local Mode

MapReduce Mode

When Pig runs in MapReduce mode, it deals with files stored in Hadoop cluster and HDFS. The MapReduce mode is the default mode here.

The command for running Pig in MapReduce mode is ‘pig’.

This enables the user to code on grunt shell.

Note:- all Hadoop daemons should be running before starting pig in MR mode

Local Mode

When Pig runs in local mode, it needs access to a single machine, where all the files are installed and run using local host and local file system.

The command for running Pig in local mode is as follows:

pig -x local

Before going further to the examples, let’s look at the steps which summarize Pig execution steps.

The first step in a Pig program is to Load the data you want to manipulate, from HDFS.
The data is then run through a set of transformations (which are translated into a set of mapper and reducer tasks).
Finally, the data is dumped onto the screen or the results are stored in a file somewhere.

Load

The dataset on which thePig will operate will contain rows and columns by default and is separated by a tab (‘ \t ’).But in general, not every dataset is in this format. So, this format has to be specified when loading the data.

The command for loading data is as follows:

A = LOAD ‘student’ USING PigStorage(‘\t’) AS (name: chararray, age:int, gpa: float);

This command loads and stores the data as structured text files in Relation Alias named ‘A’.We can also have the comma(‘ , ’) separated and semicolon(‘ ; ’) separated datasets.

Pig Commands - Example

Let’s look at some Pig commands and see how it works in MapReduce through an example dataset.

Suppose we have this dataset with some rows and columns whose Data Dictionary and data are defined as follows:

Name         Department      Salary

mohan		 	IT			 55000
raju			MEC			 40200
?manju			ECE			 65400
?kiran			CS			 45600
?prateek			EEE			 52700
?sanju			IT			 57300
?sam				CVL			 92400
ashish			CS			 35900
prateek			CS			 56700
manju			IT			 64300
kiran  		    MEC		     41020

Let’s see the outcome of some of the Pig commands operation on this dataset.

Load

Describe

Group

Average

Store

Cat

There are several commands and functions to operate within Pig. The more use cases you practice more will be the degree of perfection in performing data analysis.

Keep visiting our website Acadgild for more updates on Big Data and other technologies. Click here to learn Big Data Hadoop Development.

要查看或添加评论，请登录

Prateek K.的更多文章

Why Becoming a Data Scientist is the Next Logical Move?

2018年1月17日

Why Becoming a Data Scientist is the Next Logical Move?

Amongst the variety of job profiles, Data Scientist might not be the first job to come to your mind. But big companies…
Frequently Asked Hadoop Interview Questions in 2017 Part - 2

2017年9月20日

Frequently Asked Hadoop Interview Questions in 2017 Part - 2

Before going through this Hadoop interview questions part-2, we recommend our users to go through our previous post on…

2 条评论
Hive UseCase: Breast Cancer Data Analysis

2017年9月18日

Hive UseCase: Breast Cancer Data Analysis

This blog will help readers to learn data analytics using HIVE. The very common problem present in whole world coming…
Data Serialization with Avro in Hive

2017年8月16日

Data Serialization with Avro in Hive

This blog focuses on providing in-depth information of Avro in Hive. Here we have discussed the importance and…
File Formats in Apache HIVE

2017年7月31日

File Formats in Apache HIVE

This Blog aims at discussing the different file formats available in Apache Hive. After reading this Blog you will get…
Writable and WritableComparable in Hadoop

2017年7月28日

Writable and WritableComparable in Hadoop

This blog helps those people who want to build their own custom types in Hadoop which is possible only with Writable…

1 条评论
Frequently Asked Hadoop Interview Questions in 2017 – Part 1

2017年7月19日

Frequently Asked Hadoop Interview Questions in 2017 – Part 1

In this first Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop…

1 条评论
Static vs dynamic partition in hive

2017年4月14日

Static vs dynamic partition in hive

In our Last post, we mentioned regarding Introduction to Hive Partition therein post we clearly mentioned the most…

1 条评论
Strict Mode HIVE

2017年3月24日

Strict Mode HIVE

We have the partition in Hive to save lots of our time whereas performing analysis on the area only we want and not the…

2 条评论
Solving the Unstructured Data Dilemma

2017年3月12日

Solving the Unstructured Data Dilemma

A web browser generates tons of data every day. All of which may not be stored in a structured format.

See all articles

Beginner's Guide for Pig

Prateek K.

Data Science

MapReduce Mode

Local Mode

Load

Pig Commands - Example

Load

Describe

Group

Average

Store

Cat

Prateek K.的更多文章

社区洞察

其他会员也浏览了

APACHE HADOOP & HDFS

5 Fundamentals of Apache Spark

Introduction to Hadoop

Getting started with Apache Spark

Pig Latin and its Operators

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

Hadoop 3: Comparison with Hadoop 2 and Spark

Apache Hadoop vs Apache Spark

Apache Spark on YARN Architecture

Apache YARN: The Resource Manager for Hadoop Ecosystem

MapReduce Mode

Local Mode

Load

Pig Commands - Example

Load

Describe

Group

Average

Store

Cat

Prateek K.的更多文章

Why Becoming a Data Scientist is the Next Logical Move?

Frequently Asked Hadoop Interview Questions in 2017 Part - 2

Hive UseCase: Breast Cancer Data Analysis

Data Serialization with Avro in Hive

File Formats in Apache HIVE

Writable and WritableComparable in Hadoop

Frequently Asked Hadoop Interview Questions in 2017 – Part 1

Static vs dynamic partition in hive

Strict Mode HIVE

Solving the Unstructured Data Dilemma

社区洞察

其他会员也浏览了

APACHE HADOOP & HDFS

5 Fundamentals of Apache Spark

Introduction to Hadoop

Getting started with Apache Spark

Pig Latin and its Operators

Task Efficiency: A Comparative Study of Hadoop MapReduce, Apache Spark

Hadoop 3: Comparison with Hadoop 2 and Spark

Apache Hadoop vs Apache Spark

Apache Spark on YARN Architecture

Apache YARN: The Resource Manager for Hadoop Ecosystem