Polars (software)
   HOME

TheInfoList



OR:

Polars is an open-source software library for data manipulation. Polars is built with an OLAP query engine implemented in
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH) ...
using Apache Arrow Columnar Format as the memory model. Although built using Rust, there are Python, Node.js, R, and
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
API interfaces to use Polars.


History

The first code to be committed was made on June 23, 2020. Polars started as a "pet project" by Ritchie Vink, who was motivated to fill the gap of a data processing library in the Rust programming language. Ritchie Vink and Chiel Peters co-founded a company to develop Polars, after working together at the company Xomnia for five years. In 2023, Vink and Peters successfully closed a seed round of approximately $4 million, which was led by Bain Capital Ventures.


Features

The core object in Polars is the DataFrame, similar to other data processing software libraries. Contexts and expressions are important concepts to Polars' syntax. A context is the specific environment in which an expression is evaluated. Meanwhile, an expression refers to computations or transformations that are performed on data columns. Polars has three main contexts: * selection: choosing columns from a DataFrame * filtering: subset a DataFrame by keeping rows that meet specified conditions * group by/aggregation: calculating summary statistics within subgroups of the data Polars was also designed to be "intuitive and
ave is a Latin word, used by the Roman Empire, Romans as a salutation (greeting), salutation and greeting, meaning 'wikt:hail, hail'. It is the singular imperative mood, imperative form of the verb , which meant 'Well-being, to be well'; thus on ...
concise syntax for data processing tasks".


Compared with other data processing software


Compared to pandas


Feature differences

Given that Polars was designed to work on a single machine, this prompts many comparisons with the similar data manipulation software,
pandas Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) is a controversial hypothetical diagnosis for a subset of children with rapid onset of obsessive-compulsive disorder (OCD) or tic disorders. Sy ...
. One big advantage that Polars has over pandas is performance, where Polars is 5 to 10 times faster than pandas on similar tasks. Additionally, pandas requires around 5 to 10 times as much RAM as the size of the dataset, which compares to the 2 to 4 times needed for Polars. These performance increases may be due to Polars being written in Rust and supporting parallel operations. Polars is also designed to use
lazy evaluation In programming language theory, lazy evaluation, or call-by-need, is an evaluation strategy which delays the evaluation of an Expression (computer science), expression until its value is needed (non-strict evaluation) and which avoids repeated eva ...
(where a query optimizer will use the most efficient evaluation after looking at all steps) compared with pandas using
eager evaluation In a programming language, an evaluation strategy is a set of rules for evaluating expressions. The term is often used to refer to the more specific notion of a ''parameter-passing strategy'' that defines the kind of value that is passed to the ...
(where steps are performed immediately). Some research on comparing pandas and Polars completing data analysis tasks show that Polars is more memory-efficient than pandas, where "Polars consumes 63% of the energy needed by pandas on the TPC-H benchmark and uses eight times less energy than pandas on synthetic data". Polars does not have an index for the DataFrame object, which contrasts pandas' use of an index.


Syntax differences

Polars and pandas have similar syntax for reading in data using a read_csv() method, but have different syntax for calculating a rolling mean. Code using pandas: import pandas as pd # Read in data df_temp = pd.read_csv( "temp_record.csv", index_col="date", parse_dates=True, dtype= ) # Explore data print(df_temp.dtypes) print(df_temp.head()) # Calculate rolling mean df_temp.rolling(2).mean() Code using Polars: import polars as pl # Read in data df_temp = pl.read_csv( "temp_record.csv", try_parse_dates=True, dtypes= ).set_sorted("date") # Explore data print(df_temp.dtypes) print(df_temp.head()) # Calculate rolling average df_temp.rolling("date", period="2d").agg(pl.mean("temp"))


Compared to Dask

Dask is a Python package for applying parallel computation using NumPy, pandas, and scikit-learn, and is used for datasets that are larger than what can fit in memory. Polars is for single-machine use, while Dask is more for distributed computing.


Compared to DuckDB

DuckDB is an in-process SQL
OLAP In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction processi ...
database system for efficient analytical queries on structured data. Both DuckDB and Polars offer excellent analytical performance, but DuckDB is more SQL-centric for running queries, while Polars is Python-centric.


Compared to Spark

Apache Spark has a Python API, PySpark, for distributed big data processing. Similar to Dask, Spark is focused on distributed computing, while Polars is for single-machine use. So Polars has an advantage when processing data on a single machine, while Spark may be preferred for larger datasets that don't fit on a single machine.


See also

* Dask *
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier ...
*
pandas Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) is a controversial hypothetical diagnosis for a subset of children with rapid onset of obsessive-compulsive disorder (OCD) or tic disorders. Sy ...
*
R (programming language) R is a programming language for statistical computing and Data and information visualization, data visualization. It has been widely adopted in the fields of data mining, bioinformatics, data analysis, and data science. The core R language is ...
*
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...
*
Julia (programming language) Julia is a high-level programming language, high-level, general-purpose programming language, general-purpose dynamic programming language, dynamic programming language, designed to be fast and productive, for e.g. data science, artificial intel ...
*
List of numerical analysis software Listed here are notable end-user computer applications intended for use with numerical or data analysis: Numerical-software packages * Analytica is a widely used proprietary software tool for building and analyzing numerical models. It is a de ...


References


Further reading

* *


External links

*
User Guide - Polars
* {{GitHub, pola-rs/polars * https://github.com/ddotta/awesome-polars Free software programmed in Rust Free statistical software Software using the MIT license Data analysis software