Apache Parquet is a
free and open-source
Free and open-source software (FOSS) is software available under a Software license, license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term ...
column-oriented data storage format in the
Apache Hadoop
Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
ecosystem. It is similar to
RCFile
Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store Table (database), relational tables on computer clusters. It is designed for systems using the MapReduce framework. Th ...
and
ORC, the other columnar-storage file formats in
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
, and is compatible with most of the data processing frameworks around
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
. It provides efficient
data compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
and
encoding
In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
schemes with enhanced performance to handle complex data in bulk.
History
The
open-source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
project to build Apache Parquet began as a joint effort between
Twitter
Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...
and
Cloudera. Parquet was designed as an improvement on the Trevni columnar storage format created by
Doug Cutting, the creator of Hadoop. The first version, Apache Parquet1.0, was released in July 2013. Since April 27, 2015, Apache Parquet has been a top-level Apache Software Foundation (ASF)-sponsored project.
Features
Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex
data structures
In computer science, a data structure is a data organization and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functi ...
that can be used to store data.
The values in each column are stored in contiguous memory locations, providing the following benefits:
* Column-wise compression is efficient in storage space
* Encoding and compression techniques specific to the type of data in each column can be used
* Queries that fetch specific column values need not read the entire row, thus improving performance
Apache Parquet is implemented using the
Apache Thrift
Thrift is an IDL (interface definition language, Interface Definition Language) and Binary protocol, binary communication protocol used for defining and creating service (systems architecture), services for programming languages. It was developed ...
framework, which increases its flexibility; it can work with a number of programming languages like
C++,
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Python,
PHP, etc.
As of August 2015, Parquet supports the big-data-processing frameworks including
Apache Hive
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
,
Apache Drill,
Apache ImpalaApache Crunch Apache Pig,
Cascading,
Presto and
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
. It is one of the external data formats used by the
pandas Python data manipulation and analysis library.
Compression and encoding
In Parquet, compression is performed column by column, which enables different encoding schemes to be used for text and integer data. This strategy also keeps the door open for newer and better encoding schemes to be implemented as they are invented.
Parquet supports various compression formats:
snappy,
gzip
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and ...
,
LZO,
brotli,
zstd, and
LZ4.
Dictionary encoding
Parquet has an automatic
dictionary encoding enabled dynamically for data with a ''small'' number of unique values (i.e. below 10
5) that enables significant compression and boosts processing speed.
Bit packing
Storage of integers is usually done with dedicated 32 or 64 bits per integer. For small integers, packing multiple integers into the same space makes storage more efficient.
Run-length encoding (RLE)
To optimize storage of multiple occurrences of the same value,
run-length encoding
Run-length encoding (RLE) is a form of lossless data compression in which ''runs'' of data (consecutive occurrences of the same data value) are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather th ...
is used, which is where a single value is stored once along with the number of occurrences.
Parquet implements a hybrid of bit packing and RLE, in which the encoding switches based on which produces the best compression results. This strategy works well for certain types of integer data and combines well with dictionary encoding.
Cloud Storage and Data Lakes
Parquet is widely used as the underlying file format in modern cloud-based
data lake architectures. Cloud storage systems such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage commonly store data in Parquet format due to its efficient columnar representation and retrieval capabilities. Data lakehouse frameworks—including
Apache Iceberg, Delta Lake, and Apache Hudi
—build an additional metadata layer on top of Parquet files to support features such as schema evolution, time-travel queries, and
ACID
An acid is a molecule or ion capable of either donating a proton (i.e. Hydron, hydrogen cation, H+), known as a Brønsted–Lowry acid–base theory, Brønsted–Lowry acid, or forming a covalent bond with an electron pair, known as a Lewis ...
-compliant transactions. In these architectures, Parquet files serve as the immutable storage layer while the table formats manage data versioning and transactional integrity.
Comparison
Apache Parquet is comparable to
RCFile
Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store Table (database), relational tables on computer clusters. It is designed for systems using the MapReduce framework. Th ...
and
Optimized Row Columnar (ORC) file formats all three fall under the category of columnar data storage within the Hadoop ecosystem. They all have better compression and encoding with improved read performance at the cost of slower writes. In addition to these features, Apache Parquet supports limited
schema evolution,
["All About Parquet Part 04 — Schema Evolution in Parquet". Medium. https://medium.com/data-engineering-with-dremio/all-about-parquet-part-04-schema-evolution-in-parquet-c2c2b1aa6141] i.e., the schema can be modified according to the changes in the data. It also provides the ability to add new columns and merge schemas that do not conflict.
["All About Parquet Part 04 — Schema Evolution in Parquet". Medium. https://medium.com/data-engineering-with-dremio/all-about-parquet-part-04-schema-evolution-in-parquet-c2c2b1aa6141]
Apache Arrow is designed as an in-memory complement to on-disk columnar formats like Parquet and ORC. The Arrow and Parquet projects include libraries that allow for reading and writing between the two formats.
["Reading and Writing the Apache Parquet Format". Apache Arrow Documentation. https://arrow.apache.org/docs/python/parquet.html]
Implementations
Known implementations of Parquet include:
Apache Parquet (Java)Apache Arrow Parquet (C++)Apache Arrow Parquet (Rust)Apache Arrow Parquet (Go)jorgecarleitao/parquet2 (Rust)cuDF Parquet (C++)fastparquet (Python)Apache Impala Parquet (C++)DuckDB Parquet (C++)Polars Parquet (Rust)Velox Parquet (C++)parquet-go (ex-segmentio) (Go)parquet-go (xitongsys) (Go)hyparquet (JS)
See also
*
Apache Arrow
*
Apache Pig
*
Apache Hive
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
*
Apache Impala
*
Apache Drill
*
Apache Kudu
*
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
*
Apache Thrift
Thrift is an IDL (interface definition language, Interface Definition Language) and Binary protocol, binary communication protocol used for defining and creating service (systems architecture), services for programming languages. It was developed ...
*
Trino (SQL query engine)
*
Presto (SQL query engine)
*
SQLite embedded database system
*
DuckDB embedded OLAP database with Parquet support
References
External links
*
*
Dremel paper
{{DEFAULTSORT:Parquet
2015 software
Parquet
Cloud computing
Free system software
Hadoop
Software using the Apache license
Open formats