HOME

TheInfoList



OR:

Apache Parquet is a
free and open-source Free and open-source software (FOSS) is software available under a Software license, license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term ...
column-oriented data storage format in the
Apache Hadoop Apache Hadoop () is a collection of open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop wa ...
ecosystem. It is similar to
RCFile Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store Table (database), relational tables on computer clusters. It is designed for systems using the MapReduce framework. Th ...
and ORC, the other columnar-storage file formats in
Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
, and is compatible with most of the data processing frameworks around
Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
. It provides efficient
data compression In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compressi ...
and
encoding In communications and Data processing, information processing, code is a system of rules to convert information—such as a letter (alphabet), letter, word, sound, image, or gesture—into another form, sometimes data compression, shortened or ...
schemes with enhanced performance to handle complex data in bulk.


History

The
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
project to build Apache Parquet began as a joint effort between
Twitter Twitter, officially known as X since 2023, is an American microblogging and social networking service. It is one of the world's largest social media platforms and one of the most-visited websites. Users can share short text messages, image ...
and Cloudera. Parquet was designed as an improvement on the Trevni columnar storage format created by Doug Cutting, the creator of Hadoop. The first version, Apache Parquet1.0, was released in July 2013. Since April 27, 2015, Apache Parquet has been a top-level Apache Software Foundation (ASF)-sponsored project.


Features

Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex
data structures In computer science, a data structure is a data organization and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functi ...
that can be used to store data. The values in each column are stored in contiguous memory locations, providing the following benefits: * Column-wise compression is efficient in storage space * Encoding and compression techniques specific to the type of data in each column can be used * Queries that fetch specific column values need not read the entire row, thus improving performance Apache Parquet is implemented using the
Apache Thrift Thrift is an IDL (interface definition language, Interface Definition Language) and Binary protocol, binary communication protocol used for defining and creating service (systems architecture), services for programming languages. It was developed ...
framework, which increases its flexibility; it can work with a number of programming languages like C++,
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
, Python, PHP, etc. As of August 2015, Parquet supports the big-data-processing frameworks including
Apache Hive Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
, Apache Drill, Apache Impala
Apache Crunch
Apache Pig, Cascading, Presto and
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
. It is one of the external data formats used by the pandas Python data manipulation and analysis library.


Compression and encoding

In Parquet, compression is performed column by column, which enables different encoding schemes to be used for text and integer data. This strategy also keeps the door open for newer and better encoding schemes to be implemented as they are invented. Parquet supports various compression formats: snappy,
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and ...
, LZO, brotli, zstd, and LZ4.


Dictionary encoding

Parquet has an automatic dictionary encoding enabled dynamically for data with a ''small'' number of unique values (i.e. below 105) that enables significant compression and boosts processing speed.


Bit packing

Storage of integers is usually done with dedicated 32 or 64 bits per integer. For small integers, packing multiple integers into the same space makes storage more efficient.


Run-length encoding (RLE)

To optimize storage of multiple occurrences of the same value,
run-length encoding Run-length encoding (RLE) is a form of lossless data compression in which ''runs'' of data (consecutive occurrences of the same data value) are stored as a single occurrence of that data value and a count of its consecutive occurrences, rather th ...
is used, which is where a single value is stored once along with the number of occurrences. Parquet implements a hybrid of bit packing and RLE, in which the encoding switches based on which produces the best compression results. This strategy works well for certain types of integer data and combines well with dictionary encoding.


Cloud Storage and Data Lakes

Parquet is widely used as the underlying file format in modern cloud-based data lake architectures. Cloud storage systems such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage commonly store data in Parquet format due to its efficient columnar representation and retrieval capabilities. Data lakehouse frameworks—including Apache Iceberg, Delta Lake, and Apache Hudi —build an additional metadata layer on top of Parquet files to support features such as schema evolution, time-travel queries, and
ACID An acid is a molecule or ion capable of either donating a proton (i.e. Hydron, hydrogen cation, H+), known as a Brønsted–Lowry acid–base theory, Brønsted–Lowry acid, or forming a covalent bond with an electron pair, known as a Lewis ...
-compliant transactions. In these architectures, Parquet files serve as the immutable storage layer while the table formats manage data versioning and transactional integrity.


Comparison

Apache Parquet is comparable to
RCFile Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store Table (database), relational tables on computer clusters. It is designed for systems using the MapReduce framework. Th ...
and Optimized Row Columnar (ORC) file formats all three fall under the category of columnar data storage within the Hadoop ecosystem. They all have better compression and encoding with improved read performance at the cost of slower writes. In addition to these features, Apache Parquet supports limited schema evolution,"All About Parquet Part 04 — Schema Evolution in Parquet". Medium. https://medium.com/data-engineering-with-dremio/all-about-parquet-part-04-schema-evolution-in-parquet-c2c2b1aa6141 i.e., the schema can be modified according to the changes in the data. It also provides the ability to add new columns and merge schemas that do not conflict."All About Parquet Part 04 — Schema Evolution in Parquet". Medium. https://medium.com/data-engineering-with-dremio/all-about-parquet-part-04-schema-evolution-in-parquet-c2c2b1aa6141 Apache Arrow is designed as an in-memory complement to on-disk columnar formats like Parquet and ORC. The Arrow and Parquet projects include libraries that allow for reading and writing between the two formats."Reading and Writing the Apache Parquet Format". Apache Arrow Documentation. https://arrow.apache.org/docs/python/parquet.html


Implementations

Known implementations of Parquet include:
Apache Parquet (Java)

Apache Arrow Parquet (C++)

Apache Arrow Parquet (Rust)

Apache Arrow Parquet (Go)

jorgecarleitao/parquet2 (Rust)

cuDF Parquet (C++)

fastparquet (Python)

Apache Impala Parquet (C++)

DuckDB Parquet (C++)

Polars Parquet (Rust)

Velox Parquet (C++)

parquet-go (ex-segmentio) (Go)

parquet-go (xitongsys) (Go)

hyparquet (JS)


See also

* Apache Arrow * Apache Pig *
Apache Hive Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like Interface (computing), interface to query data stored in various databases and file systems that i ...
* Apache Impala * Apache Drill * Apache Kudu *
Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of Californ ...
*
Apache Thrift Thrift is an IDL (interface definition language, Interface Definition Language) and Binary protocol, binary communication protocol used for defining and creating service (systems architecture), services for programming languages. It was developed ...
* Trino (SQL query engine) * Presto (SQL query engine) * SQLite embedded database system * DuckDB embedded OLAP database with Parquet support


References


External links

* *
Dremel paper
{{DEFAULTSORT:Parquet 2015 software Parquet Cloud computing Free system software Hadoop Software using the Apache license Open formats