pandas is a
software library
In computer science, a library is a collection of non-volatile resources used by computer programs, often for software development. These may include configuration data, documentation, help data, message templates, pre-written code and sub ...
written for the
Python programming language
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically-typed and garbage-collected. It supports multiple programming par ...
for data manipulation and
analysis
Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
. In particular, it offers
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
s and operations for manipulating numerical tables and
time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
. It is
free software
Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, ...
released under the
three-clause BSD license. The name is derived from the term "
panel data", an
econometrics
Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships.M. Hashem Pesaran (1987). "Econometrics," '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8� ...
term for
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...
s that include observations over multiple time periods for the same individuals. Its name is a play on the phrase "Python data analysis" itself.
Wes McKinney started building what would become pandas at
AQR Capital
AQR Capital Management (Applied Quantitative Research) is a global investment management firm based in Greenwich, Connecticut, United States. The firm, which was founded in 1998 by Cliff Asness, David Kabiller, John Liew, and Robert Krail, offers ...
while he was a researcher there from 2007 to 2010.
Library features
* Many inbuilt methods available for fast data manipulation made possible with vectorisation
* DataFrame
object
Object may refer to:
General meanings
* Object (philosophy), a thing, being, or concept
** Object (abstract), an object which does not exist at any particular time or place
** Physical object, an identifiable collection of matter
* Goal, an ai ...
for multivariate data manipulation with integrated indexing.
* Series object for univariate data manipulation with integrated indexing
* Tools for reading and writing data between in-memory
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
s and different
file format
A file format is a Computer standard, standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary format, pr ...
s.
* Data alignment and integrated handling of missing data.
* Reshaping and pivoting of data sets.
* Label-based slicing, fancy indexing, and subsetting of large data sets.
* Data structure column insertion and deletion.
* Group by engine allowing split-apply-combine operations on data sets.
* Data set merging and joining.
* Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
* Time series-functionality: Date range generation and frequency conversions, moving window
statistics, moving window
linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...
s, date shifting and lagging.
*Provides data filtration.
The library is highly optimized for performance, with critical code paths written in
Cython
Cython () is a programming language that aims to be a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax.
Cython is a compil ...
or
C.
DataFrames
Pandas is mainly used for
data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, en ...
and associated manipulation of tabular data in DataFrames. Pandas allows importing data from various file formats such as
comma-separated values
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separat ...
,
JSON
JSON (JavaScript Object Notation, pronounced ; also ) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other s ...
,
Parquet,
SQL database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
tables
Table may refer to:
* Table (furniture), a piece of furniture with a flat surface and one or more legs
* Table (landform), a flat area of land
* Table (information), a data arrangement with rows and columns
* Table (database), how the table data ...
or queries, and
Microsoft Excel
Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS, Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro programming language called Visual Basic for ...
. Pandas allows various data manipulation operations such as merging, reshaping, selecting, as well as
data cleaning
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the da ...
, and
data wrangling
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one " raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes ...
features. The development of pandas introduced into Python many comparable features of working with DataFrames that were established in the
R programming language
R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinforma ...
. The pandas library is built upon another library
NumPy, which is oriented to efficiently working with
arrays
An array is a systematic arrangement of similar objects, usually in rows and columns.
Things called an array include:
{{TOC right
Music
* In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...
instead of the features of working on DataFrames.
History
Developer
Wes McKinney started working on pandas in 2008 while at
AQR Capital Management
AQR Capital Management (Applied Quantitative Research) is a global investment management firm based in Greenwich, Connecticut, United States. The firm, which was founded in 1998 by Cliff Asness, David Kabiller, John Liew, and Robert Krail, offer ...
out of the need for a high performance, flexible tool to perform
quantitative analysis
Quantitative analysis may refer to:
* Quantitative research, application of mathematics and statistics in economics and marketing
* Quantitative analysis (chemistry), the determination of the absolute or relative abundance of one or more substanc ...
on financial data. Before leaving AQR he was able to convince management to allow him to
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
the library.
Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library.
In 2015, pandas signed on as a fiscally sponsored project of
NumFOCUS, a
501(c)(3) nonprofit charity in the United States.
Timeline:
* 2008: Development of ''pandas'' started
* 2009: ''pandas'' becomes open source
* 2012: First edition of ''Python for Data Analysis'' is published
* 2015: ''pandas'' becomes a NumFOCUS sponsored project
* 2018: First in-person core developer sprint
See also
*
matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wx ...
*
NumPy
*
Dask
The DASK was the first computer in Denmark
)
, song = ( en, "King Christian stood by the lofty mast")
, song_type = National and royal anthem
, image_map = EU-Denmark.svg
, map_caption =
, subdivision_type = Sovereign state
, subdi ...
*
SciPy
SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing.
SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal ...
*
R (programming language)
R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinform ...
*
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-vector ...
*
statsmodels
*
List of numerical analysis software
Listed here are notable end-user computer applications intended for use with numerical or data analysis:
Numerical-software packages
General-purpose computer algebra systems
Interface-oriented
Language-oriented
Historically significa ...
References
Further reading
*
*
*
*
*
External links
{{Use dmy dates, date=August 2019
Free statistical software
Python (programming language) scientific libraries
Software using the BSD license