Apache Parquet

	Apache Parquet Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. History The open-source project to build Apache Parquet began as a joint effort between Twitter and Cloudera. Parquet was designed as an improvement on the Trevni columnar storage format created by Doug Cutting, the creator of Hadoop. The first version, Apache Parquet1.0, was released in July 2013. Since April 27, 2015, Apache Parquet has been a top-level Apache Software Foundation (ASF)-sponsored project. Features Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store data. The values in each colum ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Cross-platform Within computing, cross-platform software (also called multi-platform software, platform-agnostic software, or platform-independent software) is computer software that is designed to work in several Computing platform, computing platforms. Some cross-platform software requires a separate build for each platform, but some can be directly run on any platform without special preparation, being written in an interpreted language or compiled to portable bytecode for which the Interpreter (computing), interpreters or run-time packages are common or standard components of all supported platforms. For example, a cross-platform application software, application may run on Linux, macOS and Microsoft Windows. Cross-platform software may run on many platforms, or as few as two. Some frameworks for cross-platform development are Codename One, ArkUI-X, Kivy (framework), Kivy, Qt (software), Qt, GTK, Flutter (software), Flutter, NativeScript, Xamarin, Apache Cordova, Ionic (mobile app framework ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Apache Thrift Thrift is an IDL (interface definition language, Interface Definition Language) and Binary protocol, binary communication protocol used for defining and creating service (systems architecture), services for programming languages. It was developed by Facebook. Since 2020, it is an Open-source software, open source project in the Apache Software Foundation. It uses a remote procedure call (RPC) framework and combines a software stack with a code generation engine to build cross-platform services. Thrift can connect applications written in a variety of languages and frameworks, including ActionScript, C (programming language), C, C++, C Sharp (programming language), C#, Cocoa (API), Cocoa, Delphi (programming language), Delphi, Erlang (programming language), Erlang, Go (programming language), Go, Haskell (programming language), Haskell, Java (programming language), Java, JavaScript, Objective-C, OCaml, Perl, PHP, Python (programming language), Python, Ruby (programming language), Rub ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Brotli Brotli is a lossless data compression algorithm developed by Jyrki Alakuijala and Zoltán Szabadka. It uses a combination of the general-purpose LZ77 lossless compression algorithm, Huffman coding and 2nd-order context modelling. Brotli is primarily used by web servers and content delivery networks to compress HTTP content, making internet websites load faster. A successor to gzip, it is supported by all major web browsers and has become increasingly popular, as it provides better compression than gzip. History Google employees Jyrki Alakuijala and Zoltán Szabadka initially developed Brotli in 2013 to decrease the size of transmissions of WOFF web font. Alakuijala and Szabadka completed the Brotli specification during 20132016. The specification was accompanied with a reference implementation developed by two additional authors, Evgenii Kliuchnikov and Lode Vandevenne, who had previously developed Google's zopfli implementation of deflate and gzip compatible compres ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Lempel–Ziv–Oberhumer Lempel–Ziv–Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed. Design The original "lzop" implementation, released in 1996, was developed by Markus Franz Xaver Johannes Oberhumer, based on earlier algorithms by Abraham Lempel and Jacob Ziv. The LZO library implements a number of algorithms with the following characteristics: * Higher compression speed compared to DEFLATE compression * Very fast decompression * Requires an additional buffer during compression (of size 8 kB or 64 kB, depending on compression level) * Requires no additional memory for decompression other than the source and destination buffers * Allows the user to adjust the balance between compression ratio and compression speed, without affecting the speed of decompression LZO supports overlapping compression and in-place decompression. As a block compression algorithm, it compresses and decompresses blocks of data. Block size must be the same for comp ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU (from which the "g" of gzip is derived). Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993. The decompression of the ''gzip'' format can be implemented as a streaming algorithm, an important feature for Web protocols, data interchange and ETL (in standard pipes) applications. File format gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. DEFLATE was intended as a replacement for LZW and other patent-encumbered data compression algorithms which, at the time, limited the usability of the compress utility and other popular archivers. "gzip" also refers to the gzip file format (described in the table below). In sho ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Snappy (compression) Snappy (previously known as Zippy) is a fast data compression and Data decompression, decompression library written in C++ by Google based on ideas from LZ77 and open-sourced in 2011. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. Compression speed is 250 MB/s and decompression speed is 500 MB/s using a single core of a circa 2011 List of Intel Core i7 microprocessors#Westmere microarchitecture (1st generation), "Westmere" 2.26 GHz Core i7 processor running in x86-64, 64-bit mode. The Data compression ratio, compression ratio is 20–100% lower than gzip. Snappy is widely used in Google projects like Bigtable, MapReduce and in compressing data for Google's internal Remote procedure call, RPC systems. It can be used in open-source projects like MariaDB, MariaDB ColumnStore, Apache Cassandra, Cassandra, Couchbase Server, Couchbase, Hadoop, LevelDB, MongoDB, RocksDB, Luce ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Pandas (software) Pandas (styled as pandas) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term " panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010. The development of Pandas introduced into Python many comparable features of working with DataFrames that were established in the R programming language. The library is built upon another library, NumPy. History Developer Wes McKinney started working on Pandas in 2008 while at AQR Capital Management out of the need for a high performance, fl ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
picture info	Apache Spark Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab starting in 2009, in 2013, the Spark codebase was donated to the Apache Software Foundation, which has maintained it since. Overview Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not deprecated. The RDD technology still underlies the Dataset API. Spark and its RDDs were developed in 2012 in respon ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Presto (SQL Query Engine) Presto (including PrestoDB, and PrestoSQL which was re-branded to Trino) is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License. History Presto was originally designed and developed at Facebook, Inc. (later renamed Meta) for their data analysts to run interactive queries on its large data warehouse in Apache Hadoop. The first four developers were Martin Traverso, Dain Sundstrom, David Phillips, and Eric Hwang. Before Presto, the data analysts at Facebook relied on Apache Hive for running SQL analytics on their multi-petabyte data warehouse. Hive was deemed too slow for Facebook's scale and Presto was invented to fill the gap to run fast queries. Original development started in 2012 and deploye ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Cascading (software) Cascading is a software abstraction layer for Apache Hadoop and Apache Flink. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java (programming language), Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Driven, Inc. Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc, which has been re-branded as Driven. Cascading is being actively developed by the community and a number of add-on modules are available. Architecture To use Cascading, Apache Hadoop must also be installed, and the Hadoop job .jar must contain the Cascading .jars. Cascading consists of a data processing API, integration API, process planner and process scheduler. Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Apache Pig Apache Pig is a high-level platform for creating programs that run on Hadoop, Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java (programming language), Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java (programming language), Java, Python (programming language), Python, JavaScript, Ruby (programming language), Ruby or Groovy (programming language), Groovy and then call directly from the language. History Apache Pig was originally developed at Yahoo!, Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation. Naming ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]
	Apache Impala Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012. Description Apache Impala is a query engine that runs on Apache Hadoop. The project was announced in October 2012 with a public beta test distribution and became generally available in May 2013. Impala brings scalable parallel database technology to Hadoop, enabling users to issue low-latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation. Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software. Impala is promoted for analysts and data scientists to perform analytics on data stored in Hadoop via SQL or business intelligence tools. ... [...More Info...] [...Related Items...] OR: [Wikipedia] [Google] [Baidu]