23 September 2018
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The data flows in a pipeline step by step and data can be stored at any point in the pipeline It allows user to analyze large unstructured dataset by transforming, and applying functions on the dataset.
Apache Pig can be used to process complex data flows and extend them with custom code. A job can be written to collect web server logs, use external programs to fetch geo-location data for the users’ IP addresses, and join the new set of geo-located web traffic to click maps stored as JSON, web analytic data in CSV format, and spreadsheets from the advertising department to build a rich view of user behavior overlaid with advertising effectiveness.
Pig mainly has an intermediate layer which converts the Pig Latin Scripts into the MapReduce Jobs. Main Phases in the intermediate layer are
(a)Query Parsing: In this phase pig command is parsed as query.
(b)Semantic Checking: In this phase syntax of the pig script is checked .
(c)Optimization: In this phase pig tries the easiest and fastest way to get the data.
(d)Physical Planning: In this phase pig script is translated
(e)Map Reduce Processing: In this Phase ,script is converted into MapReduce code and processed.
(f)Logical validation: It is is not possible in pig but semantic checking is possible in Pig
Apache pig is a scripting langauge that can explore huge data sets with the help of an engine that executes data flows in parallel for Hadoop.
There are two main components in Apache Pig
There are number of features Pig Latin provides,among which some are mentioned below.
Below are some of the limitations Apache pig has when it’s architecture and working mechanism is compared.
Latest version of Pig has six execution modes or exectypes among which some are experimental mode which might be available for all the versions.
Use the below command to run pig in local mode .
pig -x local
It is useful to debug and check any syntactical error of from pig script using a small subset of data .
pig -x tez_local
Note: Tez local mode is experimental. There are some queries which just error out on bigger data in local mode.
pig -x spark_local
Note: Spark local mode is experimental. There are some queries which just error out on bigger data in local mode.
#Two ways to invoke pig pig pig -x mapreduce
pig -x tex
pig -x spark
In Spark execution mode, it is necessary to set env::SPARK_MASTER to an appropriate value (local - local mode, yarn-client - yarn-client mode, mesos://host:port - spark on mesos or spark://host:port - spark cluster.
We can run Pig commands in three ways.
Grunt Shell/Interactive Shell
Pig Script File
Embedded Program in Java or other Language
Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data – exactly the operations that MapReduce was originally designed for.
Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script.
Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
HIVE provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts, the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned.