Dplyr
   HOME

TheInfoList



OR:

dplyr is an
R package R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such a ...
whose set of functions are designed to enable dataframe (a spreadsheet-like
data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...
) manipulation in an intuitive, user-friendly way. It is one of the core packages of the popular
tidyverse The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that "share an underlying design philosophy, grammar, and data structures" of tidy data. Characteristic features of t ...
set of packages in the
R programming language R is a programming language for statistical computing and data visualization. It has been widely adopted in the fields of data mining, bioinformatics, data analysis, and data science. The core R language is extended by a large number of so ...
. Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization. For instance, someone seeking to analyze a large dataset may wish to only view a smaller subset of the data. Alternatively, a user may wish to rearrange the data in order to see the rows ranked by some numerical value, or even based on a combination of values from the original dataset. Functions within the dplyr package will allow a user to perform such tasks. dplyr was launched in 2014. On the dplyr web page, the package is described as "a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges."


The five core verbs

While dplyr actually includes several dozen functions that enable various forms of data manipulation, the package features five primary verbs or actions: * filter(), which is used to extract rows from a dataframe, based on conditions specified by a user; * select(), which is used to subset a dataframe by its columns; * arrange(), which is used to sort rows in a dataframe based on attributes held by particular columns; * mutate(), which is used to create new variables, by altering and/or combining values from existing columns; and * summarize(), also spelled summarise(), which is used to collapse values from a dataframe into a single summary.


Additional functions

In addition to its five main verbs, dplyr also includes several other functions that enable exploration and manipulation of dataframes. Included among these are: * count(), which is used to sum the number of unique observations that contain some particular value or categorical attribute; * rename(), which enables a user to alter the column names for variables, often to improve ease of use and intuitive understanding of a dataset; * slice_max(), which returns a data subset that contains the rows with the highest number of values for some particular variable; * slice_min(), which returns a data subset that contains the rows with the lowest number of values for some particular variable.


Built-in datasets

The dplyr package comes with five datasets. These are: band_instruments, band_instruments2, band_members, starwars, storms.


Copyright & license

The copyright to dplyr is held by
Posit PBC Posit PBC (or Posit) is an open-source data science software company. It is a public-benefit corporation founded by J. J. Allaire, creator of the programming language ColdFusion. Posit has no formal connection to the R Foundation, a not-for-pr ...
, formerly RStudio PBC. dplyr was originally released under a
GPL The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or ''copyleft'' licenses, that guarantee end users the freedom to run, study, share, or modify the software. The GPL was the first c ...
license, but in 2022, Posit changed the license terms for the package to the "more permissive"
MIT License The MIT License is a permissive software license originating at the Massachusetts Institute of Technology (MIT) in the late 1980s. As a permissive license, it puts very few restrictions on reuse and therefore has high license compatibility. Unl ...
. The main difference between the two types of license is that the MIT license allows subsequent re-use of code within
proprietary software Proprietary software is computer software, software that grants its creator, publisher, or other rightsholder or rightsholder partner a legal monopoly by modern copyright and intellectual property law to exclude the recipient from freely sharing t ...
, whereas a GPL license does not.


References

{{reflist Data analysis software Statistical software Free R (programming language) software