HOME

TheInfoList



OR:

Pandas (styled as pandas) is a
software library In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and sub ...
written for the
Python programming language Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically-typed and garbage-collected. It supports multiple programming par ...
for data manipulation and
analysis Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
. In particular, it offers
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
s and operations for manipulating numerical tables and
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
. It is
free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, ...
released under the three-clause BSD license. The name is derived from the term " panel data", an
econometrics Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships.M. Hashem Pesaran (1987). "Econometrics," '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8� ...
term for
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...
s that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at
AQR Capital AQR Capital Management (Applied Quantitative Research) is a global investment management firm based in Greenwich, Connecticut, United States. The firm, which was founded in 1998 by Cliff Asness, David Kabiller, John Liew, and Robert Krail, offers ...
while he was a researcher there from 2007 to 2010. The development of Pandas introduced into Python many comparable features of working with DataFrames that were established in the
R programming language R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinforma ...
. The library is built upon another library, NumPy.


History

Developer Wes McKinney started working on Pandas in 2008 while at
AQR Capital Management AQR Capital Management (Applied Quantitative Research) is a global investment management firm based in Greenwich, Connecticut, United States. The firm, which was founded in 1998 by Cliff Asness, David Kabiller, John Liew, and Robert Krail, offer ...
out of the need for a high performance, flexible tool to perform
quantitative analysis Quantitative analysis may refer to: * Quantitative research, application of mathematics and statistics in economics and marketing * Quantitative analysis (chemistry), the determination of the absolute or relative abundance of one or more substanc ...
on financial data. Before leaving AQR he was able to convince management to allow him to
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
the library. Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library. In 2015, Pandas signed on as a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit charity in the United States.


Data Model

Pandas is built around data structures called ''Series'' and ''DataFrames''. Data for these collections can be imported from various file formats such as
comma-separated values A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separat ...
,
JSON JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other s ...
, Parquet, SQL
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
tables Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
or queries, and
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for ...
. A ''Series'' is a 1-dimensional data structure built on top of NumPy's array. Unlike in NumPy, each data point has an associated label. The collection of these labels is called an index. Series can be used arithmetically, as in the statement series_3 = series_1 + series_2: this will align data points with corresponding index values in series_1 and series_2, then add them together to produce new values in series_3. A ''DataFrame'' is a 2-dimensional data structure of rows and columns, similar to a
spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in ce ...
, and analogous to a Python dictionary mapping column names (keys) to Series (values), with each Series sharing an index. DataFrames can be concatenated together or "merged" on columns or indices in a manner similar to
joins Join may refer to: * Join (law), to include additional counts or additional defendants on an indictment *In mathematics: ** Join (mathematics), a least upper bound of sets orders in lattice theory ** Join (topology), an operation combining two topo ...
in SQL. Pandas implements a subset of
relational algebra In database theory, relational algebra is a theory that uses algebraic structures with a well-founded semantics for modeling data, and defining queries on it. The theory was introduced by Edgar F. Codd. The main application of relational algebra ...
, and supports one-to-one, many-to-one, and many-to-many joins. Pandas also supports the less common ''Panel'' and ''Panel4D'', which are 3-dimensional and 4-dimension data structures respectively. Users can transform or summarize data by applying arbitrary functions. Since Pandas is built on top of NumPy, all NumPy functions work on Series and DataFrames as well. Pandas also includes built-in operations for arithmetic, string manipulation, and summary statistics such as
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ari ...
, median, and standard deviation. These built-in functions are designed to handle missing data, usually represented by the
floating-point In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be ...
value
NaN Nan or NAN may refer to: Places China * Nan County, Yiyang, Hunan, China * Nan Commandery, historical commandery in Hubei, China Thailand * Nan Province ** Nan, Thailand, the administrative capital of Nan Province * Nan River People Given na ...
. Subsets of data can be selected by column name, index, or
Boolean expressions In computer science, a Boolean expression is an expression used in programming languages that produces a Boolean value when evaluated. A Boolean value is either true or false. A Boolean expression may be composed of a combination of the Boolean con ...
. For example, df f['col1'> 5">col1'.html" ;"title="f['col1'">f['col1'> 5/code> will return all rows in the DataFrame df for which the value of the column col1 exceeds 5. Data can be grouped together by a column value, as in df['col1'].groupby(df['col2']), or by a function which is applied to the index. For example, df.groupby(lambda i: i % 2) groups data by whether the index is even. Pandas includes support for
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
, such as the ability to
interpolate In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points. In engineering and science, one often has a n ...
values and filter using a range of timestamps (e.g. data 1/1/2023':'2/2/2023'/code> will return all dates between January 1st and February 2nd). Pandas represents missing time series data using a special ''NaT'' (Not a Timestamp) object, instead of the NaN value it uses elsewhere.


Indices

By default, a Pandas index is a series of integers ascending from 0, similar to the indices of Python
arrays An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
. However, indices can use any NumPy data type, including floating point, timestamps, or strings. Pandas' syntax for mapping index values to relevant data is the same syntax Python uses to map dictionary keys to values. For example, if s is a Series, s a'/code> will return the data point at index a. Unlike dictionary keys, index values are not guaranteed to be unique. If a Series uses the index value a for multiple data points, then s a'/code> will instead return a new Series containing all matching values. A DataFrame's column names are stored and implemented identically to an index. As such, a DataFrame can be thought of as having two indices: one column-based and one row-based. Because column names are stored as an index, there are also not required to be unique. If data is a Series, then data a'/code> returns all values with the index value of a. However, if data is a DataFrame, then data a'/code> returns all values in the column(s) named a. To avoid this ambiguity, Pandas supports the syntax data.loc a'/code> as an alternative way to filter using the index. Pandas also supports the syntax data.iloc /code>, which always takes an integer ''n'' and returns the ''nth'' value, counting from 0. This allows a user to act as though the index is an array-like sequence of integers, regardless of how it's actually defined. Pandas supports hierarchical indices with multiple values per data point. An index with this structure, called a "MultiIndex", allows a single DataFrame to represent multiple dimensions, similar to a
pivot table A pivot table is a table of grouped values that aggregates the individual items of a more extensive table (such as from a database, spreadsheet, or business intelligence program) within one or more discrete categories. This summary might include ...
in
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for ...
. Each level of a MultiIndex can be given a unique name. In practice, data with more than 2 dimensions is often represented using DataFrames with hierarchical indices, instead of the higher-dimension ''Panel'' and ''Panel4D'' data structures


Criticisms

Pandas has been criticized for its inefficiency. Pandas can require 5 to 10 times as much memory as the size of the underlying data, and the entire dataset must be loaded in
RAM Ram, ram, or RAM may refer to: Animals * A male sheep * Ram cichlid, a freshwater tropical fish People * Ram (given name) * Ram (surname) * Ram (director) (Ramsubramaniam), an Indian Tamil film director * RAM (musician) (born 1974), Dutch ...
. The library does not optimize query plans or support parallel computing across multiple cores. Wes McKinney, the creator of Pandas, has recommended Apache Arrow as an alternative to address these performance concerns and other limitations.


See also

*
matplotlib Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wx ...
* NumPy *
Dask The DASK was the first computer in Denmark ) , song = ( en, "King Christian stood by the lofty mast") , song_type = National and royal anthem , image_map = EU-Denmark.svg , map_caption = , subdivision_type = Sovereign state , subdi ...
*
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal ...
* Polars *
R (programming language) R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinform ...
*
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector ...
*
List of numerical analysis software Listed here are notable end-user computer applications intended for use with numerical or data analysis: Numerical-software packages General-purpose computer algebra systems Interface-oriented Language-oriented Historically significa ...


References


Further reading

* * * * * {{SciPy ecosystem Free statistical software Python (programming language) scientific libraries Software using the BSD license