Apache Parquet is a
free and open-source
Free and open-source software (FOSS) is a term used to refer to groups of software consisting of both free software and open-source software where anyone is freely licensed to use, copy, study, and change the software in any way, and the source ...
column-oriented data storage format in the
Apache Hadoop ecosystem. It is similar to
RCFile and
ORC, the other columnar-storage file formats in
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
, and is compatible with most of the data processing frameworks around
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
. It provides efficient
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
and
encoding
In communications and information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or secrecy, secret ...
schemes with enhanced performance to handle complex data in bulk.
History
The
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
project to build Apache Parquet began as a joint effort between
Twitter
Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
and
Cloudera
Cloudera, Inc. is an American software company providing enterprise data management systems that make significant use of Apache Hadoop. As of January 31, 2021, the company had approximately 1,800 customers.
History
Cloudera, Inc. was formed on ...
. Parquet was designed as an improvement on the Trevni columnar storage format created by
Doug Cutting
Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundat ...
, the creator of Hadoop. The first version, Apache Parquet1.0, was released in July 2013. Since April 27, 2015, Apache Parquet has been a top-level Apache Software Foundation (ASF)-sponsored project.
Features
Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex
data structures
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
that can be used to store data.
The values in each column are stored in contiguous memory locations, providing the following benefits:
* Column-wise compression is efficient in storage space
* Encoding and compression techniques specific to the type of data in each column can be used
* Queries that fetch specific column values need not read the entire row, thus improving performance
Apache Parquet is implemented using the
Apache Thrift
Thrift is an interface definition language and
binary communication protocol
used for defining and creating services for numerous programming languages. It was developed at Facebook for "scalable cross-language services development" and as of 2 ...
framework, which increases its flexibility; it can work with a number of programming languages like
C++,
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mo ...
,
Python,
PHP
PHP is a General-purpose programming language, general-purpose scripting language geared toward web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementati ...
, etc.
As of August 2015,
Parquet supports the big-data-processing frameworks including
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
,
Apache Drill
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's D ...
,
Apache Impala
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its developmen ...
Apache Crunch Apache Pig,
Cascading,
Presto
Presto may refer to:
Computing
* Presto (browser engine), an engine previously used in the Opera web browser
* Presto (operating system), a Linux-based OS by Xandros
* Presto (SQL query engine), a distributed query engine
* Presto (animation s ...
and
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Califor ...
.
Compression and encoding
In Parquet, compression is performed column by column, which enables different encoding schemes to be used for text and integer data. This strategy also keeps the door open for newer and better encoding schemes to be implemented as they are invented.
Dictionary encoding
Parquet has an automatic dictionary encoding enabled dynamically for data with a ''small'' number of unique values (i.e. below 10
5) that enables significant compression and boosts processing speed.
Bit packing
Storage of integers is usually done with dedicated 32 or 64 bits per integer. For small integers, packing multiple integers into the same space makes storage more efficient.
Run-length encoding
Run-length encoding (RLE) is a form of lossless data compression in which ''runs'' of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original ...
(RLE)
To optimize storage of multiple occurrences of the same value, a single value is stored once along with the number of occurrences.
Parquet implements a hybrid of bit packing and RLE, in which the encoding switches based on which produces the best compression results. This strategy works well for certain types of integer data and combines well with dictionary encoding.
Comparison
Apache Parquet is comparable to
RCFile and
Optimized Row Columnar (ORC) file formats all three fall under the category of columnar data storage within the Hadoop ecosystem. They all have better compression and encoding with improved read performance at the cost of slower writes. In addition to these features, Apache Parquet supports limited
schema evolution, i.e., the schema can be modified according to the changes in the data. It also provides the ability to add new columns and merge schemas that do not conflict.
See also
*
Apache Pig
*
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditi ...
*
Apache Impala
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its developmen ...
*
Apache Drill
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's D ...
*
Apache Kudu
*
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Califor ...
*
Apache Thrift
Thrift is an interface definition language and
binary communication protocol
used for defining and creating services for numerous programming languages. It was developed at Facebook for "scalable cross-language services development" and as of 2 ...
*
Trino (SQL query engine)
*
Presto (SQL query engine)
*
SQLite
SQLite (, ) is a database engine written in the C programming language. It is not a standalone app; rather, it is a library that software developers embed in their apps. As such, it belongs to the family of embedded databases. It is the mo ...
embedded database system
References
External links
*
*
Dremel paperHow to Be a Hero with Powerful Apache Parquet, Google and Amazon
{{DEFAULTSORT:Parquet
2015 software
Parquet
Cloud computing
Free system software
Hadoop
Software using the Apache license