HOME

TheInfoList



OR:

Apache Pig is a high-level platform for creating programs that run on
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...
, Apache Tez, or
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
. Pig Latin abstracts the programming from the
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
for
relational database management system A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured for ...
s. Pig Latin can be extended using
user-defined function A user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. UDFs are usually written for the requirement of its c ...
s (UDFs) which the user can write in
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
,
JavaScript JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior. Web browsers have ...
,
Ruby Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...
or
Groovy ''Groovy'' (or, less commonly, ''groovie'' or ''groovey'') is a slang colloquialism popular during the 1960s and 1970s. It is roughly synonymous with words such as "excellent", "fashionable", or "amazing", depending on context. History The word ...
and then call directly from the language.


History

Apache Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007, it was moved into the
Apache Software Foundation The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
.


Naming

Regarding the naming of the Pig programming language, the name was chosen arbitrarily and stuck because it was memorable, easy to spell, and for novelty.


Example

Below is an example of a "
Word Count The word count is the number of words in a document or passage of text. Word counting may be needed when a text is required to stay within certain numbers of words. This may particularly be the case in academia, legal proceedings, journalism and ad ...
" program in Pig Latin: input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/number-of-words-on-internet'; The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet.


Pig vs SQL

In comparison to SQL, Pig # has a nested relational model, # uses
lazy evaluation In programming language theory, lazy evaluation, or call-by-need, is an evaluation strategy which delays the evaluation of an Expression (computer science), expression until its value is needed (non-strict evaluation) and which avoids repeated eva ...
, # uses
extract, transform, load Extract, transform, load (ETL) is a three-phase computing process where data is ''extracted'' from an input source, ''transformed'' (including cleaning), and ''loaded'' into an output data container. The data can be collected from one or mor ...
(ETL), # is able to store data at any point during a
pipeline A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
, # declares
execution plan Capital punishment, also known as the death penalty and formerly called judicial homicide, is the state-sanctioned killing of a person as punishment for actual or supposed misconduct. The sentence ordering that an offender be punished in s ...
s, # supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines. On the other hand, it has been argued
DBMS In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
s are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued
RDBMS A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A Relational Database Management System (RDBMS) is a type of database management system that stores data in a structured forma ...
s offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance. Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead
declarative Declarative may refer to: * Declarative learning, acquiring information that one can speak about * Declarative memory, one of two types of long term human memory * Declarative programming, a computer programming paradigm * Declarative sentence, a t ...
. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways. In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task. SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a
directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...
(DAG) rather than a pipeline. Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.


See also

*
Apache Hive Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
* Sawzall — similar tool from Google


References


External links

* {{DEFAULTSORT:Pig Cloud computing Query languages Data modeling languages Hadoop
Pig The pig (''Sus domesticus''), also called swine (: swine) or hog, is an omnivorous, domesticated, even-toed, hoofed mammal. It is named the domestic pig when distinguishing it from other members of the genus '' Sus''. Some authorities cons ...
Free software programmed in Java (programming language) Software using the Apache license