R is a

programming language A programming language is a system of notation for writing computer programs. Most programming languages are text-based formal languages, but they may also be graphical. They are a kind of computer language. The description of a programming l ...

for

statistical computing Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computi ...

and

data visualization Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is nume ...

. It has been adopted in the fields of data mining,

bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combin ...

, and

data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, en ...

. The core ''R'' language is augmented by a large number of extension packages, containing reusable code, documentation, and sample data. ''R'' software is

open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...

and

free software Free software or libre software is computer software distributed under terms that allow users to run the software for any purpose as well as to study, change, and distribute it and any adapted versions. Free software is a matter of liberty, ...

. It is licensed by the

GNU Project The GNU Project () is a free software, mass collaboration project announced by Richard Stallman on September 27, 1983. Its goal is to give computer users freedom and control in their use of their computers and Computer hardware, computing devi ...

and available under the

GNU General Public License The GNU General Public License (GNU GPL or simply GPL) is a series of widely used free software licenses that guarantee end user In product development, an end user (sometimes end-user) is a person who ultimately uses or is intended to ulti ...

. It is written primarily in C, Fortran, and R itself. Precompiled

executable In computing, executable code, an executable file, or an executable program, sometimes simply referred to as an executable or binary, causes a computer "to perform indicated tasks according to encoded instructions", as opposed to a data fil ...

s are provided for various

operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ef ...

s. As an

interpreted language In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpre ...

, ''R'' has a native

command line interface A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invoking executables and pro ...

. Moreover, multiple third-party

graphical user interface The GUI ( "UI" by itself is still usually pronounced . or ), graphical user interface, is a form of user interface that allows User (computing), users to Human–computer interaction, interact with electronic devices through graphical icon (comp ...

s are available, such as

RStudio RStudio is an integrated development environment for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server ...

—an

integrated development environment An integrated development environment (IDE) is a software application that provides comprehensive facilities to computer programmers for software development. An IDE normally consists of at least a source code editor, build automation tools a ...

—and

Jupyter Project Jupyter () is a project with goals to develop open-source software, open standards, and services for interactive computing across multiple programming languages. It was spun off from IPython in 2014 by Fernando Pérez and Brian Granger. ...

—a

notebook interface A notebook interface (also called a computational notebook) is a virtual notebook environment used for literate programming, a method of writing computer programs. Some notebooks are WYSIWYG environments including executable calculations embedded ...

History

''R'' was started by professors

Ross Ihaka George Ross Ihaka (born 1954) is a New Zealand statistician who was an Associate Professor of Statistics at the University of Auckland until his retirement in 2017. Alongside Robert Gentleman, he is one of the creators of the R programming la ...

and Robert Gentleman as a programming language to teach introductory statistics at the

University of Auckland , mottoeng = By natural ability and hard work , established = 1883; years ago , endowment = NZD $293 million (31 December 2021) , budget = NZD $1.281 billion (31 December 2021) , chancellor = Cecilia Tarrant , vice_chancellor = Dawn ...

. The language was inspired by the S programming language, with most S programs able to run unaltered in ''R''. The language was also inspired by Scheme's

lexical scoping In computer programming, the scope of a name binding (an association of a name to an entity, such as a variable) is the part of a program where the name binding is valid; that is, where the name can be used to refer to the entity. In other parts o ...

, allowing for local variables. The name of the language, ''R'', comes from being both an S language successor as well as the shared first letter of the authors, Ross and Robert. In August 1993, Ihaka and Gentleman posted a

binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...

of ''R'' o
StatLib
— a data archive

website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google, Facebook, Amazon, and Wikip ...

. At the same time, they announced the posting on the ''s-news'' mailing list. On December 5, 1997, ''R'' became a

GNU project The GNU Project () is a free software, mass collaboration project announced by Richard Stallman on September 27, 1983. Its goal is to give computer users freedom and control in their use of their computers and Computer hardware, computing devi ...

when version 0.60 was released. On February 29, 2000, the first official 1.0 version was released.

Examples

Mean -- a measure of center

A numeric

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the d ...

may have a

central tendency In statistics, a central tendency (or measure of central tendency) is a central or typical value for a probability distribution.Weisberg H.F (1992) ''Central Tendency and Variability'', Sage University Paper Series on Quantitative Applications ...

— where some of the most typical

data points In statistics, a unit of observation is the unit described by the data that one analyzes. A study may treat groups as a unit of observation with a country as the unit of analysis, drawing conclusions on group characteristics from data collected a ...

reside. The

arithmetic mean In mathematics and statistics, the arithmetic mean ( ) or arithmetic average, or just the ''mean'' or the '' average'' (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. The coll ...

(

average In ordinary language, an average is a single number taken as representative of a list of numbers, usually the sum of the numbers divided by how many numbers are in the list (the arithmetic mean). For example, the average of the numbers 2, 3, 4, 7, ...

) is the most commonly used measure of central tendency. The ''mean'' of a numeric data set is the sum of the data points divided by the number of data points. :Let

\bar

= the mean of a data set. :Let

x

= a list of data points. :Let

n

= the number of data points. :

\bar = \frac

Suppose a sample of four

observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. Th ...

s of Celsius temperature measurements were taken 12 hours apart. :Let

x

= a list of degrees Celsius data points of 30, 27, 31, 28. This ''R''

computer program A computer program is a sequence or set of instructions in a programming language for a computer to execute. Computer programs are one component of software, which also includes documentation and other intangible components. A computer progra ...

will output the mean of

x

: # The c() function "combines" a list into a single object. x <- c( 30, 27, 31, 28 ) sum <- sum( x ) length <- length( x ) mean <- sum / length message( "Mean:" ) print( mean ) Note: ''R'' can have the same

identifier An identifier is a name that identifies (that is, labels the identity of) either a unique object or a unique ''class'' of objects, where the "object" or class may be an idea, physical countable object (or class thereof), or physical noncountable ...

represent both a function name and its result. For more information, visit

scope Scope or scopes may refer to: People with the surname * Jamie Scope (born 1986), English footballer * John T. Scopes (1900–1970), central figure in the Scopes Trial regarding the teaching of evolution Arts, media, and entertainment * CinemaS ...

. Output: Mean: 29 This ''R'' program will execute the native mean() function to output the mean of

x: x <- c( 30, 27, 31, 28 ) message( "Mean:" ) print( mean( x ) ) Output: Mean: 29

Standard Deviation -- a measure of dispersion

A standard deviation of a numeric data set is an indication of the average distance all the data points are from the mean. For a data set with a small amount of variation, then each data point will be close to the mean, so the ''standard deviation'' will be small. :Let

s

= the ''standard deviation'' of a data set. :Let

x

= a list of data points. :Let

n

= the number of data points. :

s = \sqrt

Suppose a sample of four observations of Celsius temperature measurements were taken 12 hours apart. :Let

x

= a list of degrees Celsius data points of 30, 27, 31, 28. This ''R'' program will output the ''standard deviation'' of

x

: x <- c( 30, 27, 31, 28 ) distanceFromMean <- x - mean( x ) distanceFromMeanSquared <- distanceFromMean ** 2 distanceFromMeanSquaredSum <- sum( distanceFromMeanSquared ) variance <- distanceFromMeanSquaredSum / ( length( x ) - 1 ) standardDeviation <- sqrt( variance ) message( "Standard deviation:" ) print( standardDeviation ) Output: Standard deviation: 1.825742 This ''R'' program will execute the native sd() function to output the ''standard deviation'' of

x

: x <- c( 30, 27, 31, 28 ) message( "Standard deviation:" ) print( sd( x ) ) Output: Standard deviation: 1.825742

Linear regression -- a measure of relation

phenomenon A phenomenon (plural, : phenomena) is an observable event. The term came into its modern Philosophy, philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be directly observed. Kant was heavily influe ...

may be the result of one or more

observable In physics, an observable is a physical quantity that can be measured. Examples include position and momentum. In systems governed by classical mechanics, it is a real-valued "function" on the set of all possible system states. In quantum phys ...

events Event may refer to: Gatherings of people * Ceremony, an event of ritual significance, performed on a special occasion * Convention (meeting), a gathering of individuals engaged in some common interest * Event management, the organization of ev ...

. For example, the phenomenon of skiing accidents may be the result of having snow in the mountains. A method to measure whether or not a numeric data set is related to another data set is

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is ...

. :Let

x

= a data set of independent data points, in which each point occurred at a specific time. :Let

y

= a data set of dependent data points, in which each point occurred at the same time of an independent data point. If a

linear Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...

relationship exists, then a

scatter plot A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. ...

of the two data sets will show a pattern that resembles a straight line. If a straight line is embedded into the scatter plot such that the average distance from all the points to the line is minimal, then the line is called a regression line. The equation of the ''regression line'' is called the regression equation. The ''regression equation'' is a

linear equation In mathematics, a linear equation is an equation that may be put in the form a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coeffici ...

; therefore, it has a

slope In mathematics, the slope or gradient of a line is a number that describes both the ''direction'' and the ''steepness'' of the line. Slope is often denoted by the letter ''m''; there is no clear answer to the question why the letter ''m'' is used ...

and

y-intercept In analytic geometry, using the common convention that the horizontal axis represents a variable ''x'' and the vertical axis represents a variable ''y'', a ''y''-intercept or vertical intercept is a point where the graph of a function or relati ...

. The format of the ''regression equation'' is

\hat = b_ + b_x

. :Let

b_

= the slope of the ''regression equation''. :

b_ = \frac

:Let

b_

= the y-intercept of the ''regression equation''. :

b_ = \bar - b_\bar

Suppose a sample of four observations of Celsius temperature measurements were taken 12 hours apart. At the same time, the thermometer was switched to

Fahrenheit The Fahrenheit scale () is a temperature scale based on one proposed in 1724 by the physicist Daniel Gabriel Fahrenheit (1686–1736). It uses the degree Fahrenheit (symbol: °F) as the unit. Several accounts of how he originally defined h ...

temperature and another measurement was taken. :Let

x

= a list of degrees Celsius data points of 30, 27, 31, 28. :Let

y

= a list of degrees Fahrenheit data points of 86.0, 80.6, 87.8, 82.4. This ''R'' program will output the ''slope'' and ''y-intercept'' of a linear relationship in which

y

depends upon

x

: x <- c( 30, 27, 31, 28 ) y <- c( 86.0, 80.6, 87.8, 82.4 ) # Build the numerator independentDistanceFromMean <- x - mean( x ) sampledDependentDistanceFromMean <- y - mean( y ) independentDistanceTimesSampledDistance <- independentDistanceFromMean * sampledDependentDistanceFromMean independentDistanceTimesSampledDistanceSum <- sum( independentDistanceTimesSampledDistance ) # Build the denominator independentDistanceFromMeanSquared <- independentDistanceFromMean ** 2 independentDistanceFromMeanSquaredSum <- sum( independentDistanceFromMeanSquared ) # Slope is rise over run slope <- independentDistanceTimesSampledDistanceSum / independentDistanceFromMeanSquaredSum yIntercept <- mean( y ) - slope * ( mean( x ) ) message( "Slope:" ) print( slope ) message( "Y-intercept:" ) print( yIntercept ) Output: Slope: 1.8 Y-intercept: 32 This ''R'' program will execute the native functions to output the ''slope'' and ''y-intercept'': x <- c( 30, 27, 31, 28 ) y <- c( 86.0, 80.6, 87.8, 82.4 ) # Execute lm() with Fahrenheit depends upon Celsius linearModel <- lm( y ~ x ) # coefficients() returns a structure containing the slope and y intercept coefficients <- coefficients( linearModel ) # Extract the slope from the structure slope <- coefficients "x" # Extract the y intercept from the structure yIntercept <- coefficients "(Intercept)" message( "Slope:" ) print( slope ) message( "Y-intercept:" ) print( yIntercept ) Output: Slope: 1.8 Y-intercept: 32

Coefficient of determination -- a percentage of variation

The

coefficient of determination In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s). It is a statistic used ...

determines the percentage of variation explained by the independent variable. It always lies between 0 and 1. A value of 0 indicates no relationship between the two data sets, and a value near 1 indicates the ''regression equation'' is extremely useful for making predictions. :Let

\hat

= the data set of predicted response data points when the independent data points are passed through the ''regression equation''. :Let

r^

= the ''coefficient of determination'' in a relationship between an independent variable and a dependent variable. :

r^ = \frac

This ''R'' program will output the ''coefficient of determination'' of the linear relationship between

x

and

y

: x <- c( 30, 27, 31, 28 ) y <- c( 86.0, 80.6, 87.8, 82.4 ) # Build the numerator linearModel <- lm( y ~ x ) coefficients <- coefficients( linearModel ) slope <- coefficients "x" yIntercept <- coefficients "(Intercept)" predictedResponse <- yIntercept + ( slope * x ) predictedResponseDistanceFromMean <- predictedResponse - mean( y ) predictedResponseDistanceFromMeanSquared <- predictedResponseDistanceFromMean ** 2 predictedResponseDistanceFromMeanSquaredSum <- sum( predictedResponseDistanceFromMeanSquared ) # Build the denominator sampledResponseDistanceFromMean <- y - mean( y ) sampledResponseDistanceFromMeanSquared <- sampledResponseDistanceFromMean ** 2 sampledResponseDistanceFromMeanSquaredSum <- sum( sampledResponseDistanceFromMeanSquared ) coefficientOfDetermination <- predictedResponseDistanceFromMeanSquaredSum / sampledResponseDistanceFromMeanSquaredSum message( "Coefficient of determination:" ) print( coefficientOfDetermination ) Output: Coefficient of determination: 1 This ''R'' program will execute the native functions to output the ''coefficient of determination'': x <- c( 30, 27, 31, 28 ) y <- c( 86.0, 80.6, 87.8, 82.4 ) linearModel <- lm( y ~ x ) summary <- summary( linearModel ) coefficientOfDetermination <- summary "r.squared" message( "Coefficient of determination:" ) print( coefficientOfDetermination ) Output: Coefficient of determination: 1

Single plot

This ''R'' program will display a

with an embedded ''regression line'' and ''regression equation'' illustrating the relationship between

x

and

y

: x <- c( 30, 27, 31, 28 ) y <- c( 86.0, 80.6, 87.8, 82.4 ) linearModel <- lm( y ~ x ) coefficients <- coefficients( linearModel ) slope <- coefficients "x" intercept <- coefficients "(Intercept)" # Execute paste() to build the regression equation string regressionEquation <- paste( "y =", intercept, "+", slope, "x" ) # Display a scatter plot with the regression line and equation embedded plot( x, y, main = "Fahrenheit Depends Upon Celsius", sub = regressionEquation, xlab = "Degress Celsius", ylab = "Degress Fahrenheit", abline( linearModel ) ) Output:

Multi plot

This R program will generate a multi-plot and a table of residuals. # The independent variable is a list of numbers 1 to 6. x <- 1:6 # The dependent variable is a list of each independent variable squared. y <- x^2 # Executing the linear model on a quadratic equation will produce residuals. linearModel <- lm(y ~ x) # Display the residuals. summary( linearModel ) # Create a 2 by 2 multi-plot. par(mfrow = c(2, 2)) # Output the multi-plot. plot( linearModel ) Output: Residuals: 1 2 3 4 5 6 7 8 9 10 3.3333 -0.6667 -2.6667 -2.6667 -0.6667 3.3333 Coefficients: Estimate Std. Error t value Pr(>, t, ) (Intercept) -9.3333 2.8441 -3.282 0.030453 * x 7.0000 0.7303 9.585 0.000662 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.055 on 4 degrees of freedom Multiple R-squared: 0.9583, Adjusted R-squared: 0.9478 F-statistic: 91.88 on 1 and 4 DF, p-value: 0.000662 Plots from lm example

Mandelbrot graphic

This

Mandelbrot set The Mandelbrot set () is the set of complex numbers c for which the function f_c(z)=z^2+c does not diverge to infinity when iterated from z=0, i.e., for which the sequence f_c(0), f_c(f_c(0)), etc., remains bounded in absolute value. This ...

example highlights the use of

complex numbers In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted , called the imaginary unit and satisfying the equation i^= -1; every complex number can be expressed in the for ...

. It models the first 20 iterations of the equation z = z² + c, where c represents different complex constants. Install the package that provides the write.gif() function beforehand: install.packages("caTools") R program: library(caTools) jet.colors <- colorRampPalette( c("green", "pink", "#007FFF", "cyan", "#7FFF7F", "white", "#FF7F00", "red", "#7F0000")) dx <- 1500 # define width dy <- 1400 # define height C <- complex( real = rep( seq(-2.2, 1.0, length.out = dx), each = dy), imag = rep(seq(-1.2, 1.2, length.out = dy), dx)) # reshape as matrix of complex numbers C <- matrix(C, dy, dx) # initialize output 3D array X <- array(0, c(dy, dx, 20)) Z <- 0 # loop with 20 iterations for (k in 1:20) write.gif( X, "Mandelbrot.gif", col = jet.colors, delay = 100) Output: Mandelbrot Creation Animation

Programming

''R'' is an

, so

programmer A computer programmer, sometimes referred to as a software developer, a software engineer, a programmer or a coder, is a person who creates computer programs — often for larger computer software. A programmer is someone who writes/creates ...

s typically access it through a

command-line interpreter A command-line interpreter or command-line processor uses a command-line interface (CLI) to receive command (computing), commands from a user in the form of lines of text. This provides a means of setting parameters for the environment, invokin ...

. If a programmer types 1+1 at the ''R'' command prompt and presses enter, the computer replies with 2. Programmers also save ''R''

programs Program, programme, programmer, or programming may refer to: Business and management * Program management, the process of managing several related projects * Time management * Program, a part of planning Arts and entertainment Audio * Programm ...

to a

file File or filing may refer to: Mechanical tools and processes * File (tool), a tool used to ''remove'' fine amounts of material from a workpiece ** Filing (metalworking), a material removal process in manufacturing ** Nail file, a tool used to g ...

then

execute Execute, in capital punishment Capital punishment, also known as the death penalty, is the state-sanctioned practice of deliberately killing a person as a punishment for an actual or supposed crime, usually following an authorized, rule- ...

the batch interprete
Rscript

Object

''R'' stores data inside an

object Object may refer to: General meanings * Object (philosophy), a thing, being, or concept ** Object (abstract), an object which does not exist at any particular time or place ** Physical object, an identifiable collection of matter * Goal, an ai ...

. An object is assigned a name which the

uses to set and retrieve the data. An object is created by placing its name to the left of the

symbol A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, object, or relationship. Symbols allow people to go beyond what is known or seen by creating linkages between otherwise very different conc ...

pair <-. To create an object named x and assign it the

integer An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language ...

82: x <- 82L print( x ) Output: 82 The

 /code> displayed before the number is a subscript 




A subscript or superscript is a character (such as a number or letter) that is set slightly below or above the normal line of type, respectively. It is usually smaller than the rest of the text. Subscripts appear at or below the  baseline, whil ...
. It shows the container for this integer is index one of an array 


An array is a systematic arrangement of similar objects, usually in rows and columns.

Things called an array include:
{{TOC right
  Music 
* In  twelve-tone and  serial composition, the presentation of simultaneous twelve-tone sets such that the ...
.

  Vector 

The most primitive ''R object'' is the vector 




Vector most often refers to:
*Euclidean vector, a quantity with a magnitude and a direction 
*Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism


Vector may also refer to:
 Mathematic ...
. A ''vector'' is a one dimensional array 


An array is a systematic arrangement of similar objects, usually in rows and columns.

Things called an array include:
{{TOC right
  Music 
* In  twelve-tone and  serial composition, the presentation of simultaneous twelve-tone sets such that the ...
 of data. To assign multiple elements to the array, use the c() function to "combine" the elements. The elements must be the same data type 




In computer science and computer programming, a data type (or simply type) is a set of possible values and a set of allowed operations on it. A data type tells the compiler or  interpreter how the programmer intends to use the data. Most progra ...
. ''R'' lacks scalar 

Scalar may refer to:

*Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers
*Scalar (physics), a physical quantity that can be described by a single element of a number field such a ...
 data types, which are placeholders for a single word 

 



A word is a basic element of language that carries an objective or practical  meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no consen ...
 — usually an integer. Instead, a single integer is stored into the first element of an array. The single integer is retrieved using the  index subscript of  /code>.

''R'' program to store and retrieve a single integer:

store <- 82L
retrieve <- store print( retrieve )


Output:

 82


  Element-wise operation 

When an operation 
Operation or Operations may refer to:

  Arts, entertainment and media 

*  ''Operation'' (game), a battery-operated board game that challenges dexterity
* Operation (music), a term used in musical set theory
*  ''Operations'' (magazine), Multi-Man ...
 is applied to a vector, ''R'' will apply the operation to each element in the array. This is called an ''element-wise operation''.

This example creates the object named x and assigns it integers 1 through 3. The object is displayed and then again with one added to each element:

x <- 1:3
print( x )
print( x + 1 )


Output:

 1 2 3
 2 3 4


To achieve the many additions, ''R'' implements ''vector recycling''. The  numeral one following the plus sign 




The plus and minus signs,  and , are  mathematical symbols used to represent the notions of  positive and  negative, respectively. In addition,  represents the operation of  addition, which results in a sum, while  represents subtraction, resul ...
 (+) is converted into an internal array of three ones. The + operation simultaneously loops through both arrays and performs the addition on each element pair. The results are stored into another internal array of three elements which is returned to the print() function.

  Numeric vector 

A ''numeric'' vector is used to store integer 





An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the  additive inverses of the corresponding positive numbers. In the  language ...
s and floating point numbers 




In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base.  For example, 12.345 can be r ...
. The primary characteristic of a ''numeric'' vector is the ability to perform arithmetic on the elements.

 = Integer vector 
=
By default, integers (numbers without a decimal point) are stored as floating point. To force integer memory allocation, append an L to the number. As an exception, the sequence operator : will, by default, allocate integer memory.

''R'' program:

x <- 82L
print( x )
message( "Data type:" )
typeof( x )


Output:

 82
Data type:
 "integer"


''R'' program:

x <- c( 1L, 2L, 3L )
print( x )
message( "Data type:" )
typeof( x )


Output:

 1 2 3
Data type:
 "integer"


''R'' program:

x <- 1:3
print( x )
message( "Data type:" )
typeof( x )


Output:

 1 2 3
Data type:
 "integer"


 = Double vector 
=
A ''double vector'' stores real number 



In mathematics, a real number is a number that can be used to measurement, measure a ''continuous'' one-dimensional quantity such as a distance, time, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small var ...
s, which are also known as floating point numbers 




In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base.  For example, 12.345 can be r ...
. The memory allocation for a floating point number is double precision 

Double-precision floating-point format (sometimes called FP64 or float64) is a floating-point  number format, usually occupying 64  bits in computer memory; it represents a wide  dynamic range of numeric values by using a floating  radix point.

F ...
. Double precision is the default memory allocation for numbers with or without a decimal point.

''R'' program:

x <- 82
print( x )
message( "Data type:" )
typeof( x )


Output:

 82
Data type:
 "double"


''R'' program:

x <- c( 1, 2, 3 )
print( x )
message( "Data type:" )
typeof( x )


Output:

 1 2 3
Data type:
 "double"


  Logical vector 

A ''logical vector'' stores binary data 


Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra.

Binary data occurs in many different technical and scientific fields, wher ...
 — either TRUE or FALSE. The purpose of this vector is to store the result of a comparison. A logical datum is expressed as either TRUE, T, FALSE, or F. The capital letters are required, and no quotes surround the  constants.

''R'' program:

x <- 3 < 4
print( x )
message( "Data type:" )
typeof( x )


Output:

 TRUE
Data type:
 "logical"


Two vectors may be compared using the following ''logical operators'':




  Character vector 

A ''character vector'' stores  character strings. Strings are created by surrounding text in double quotation marks.

''R'' program:

x <- "hello world"
print( x )
message( "Data type:" )
typeof( x )


Output:

 "hello world"
Data type:
 "character"


''R'' program:

x <- c( "hello", "world" )
print( x )
message( "Data type:" )
typeof( x )


Output:

 "hello" "world"
Data type:
 "character"


  Factor 

A ''Factor'' is a vector that stores a  categorical variable. The factor function converts a text string 



In computer programming, a string is traditionally a sequence of characters, either as a  literal constant or as some kind of  variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...
 into an enumerated type 
In computer programming, an enumerated type (also called enumeration, enum, or factor in the  R programming language, and a  categorical variable in statistics) is a data type consisting of a set of named  values called ''elements'', ''members'', ' ...
, which is stored as an integer 





An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the  additive inverses of the corresponding positive numbers. In the  language ...
.

In experimental design 





The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
, a ''factor'' is an independent variable 


Dependent and independent variables are  variables in mathematical modeling, statistical modeling and  experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...
 to test (an input) in a controlled experiment 




A scientific control is an  experiment or observation designed to minimize the effects of variables other than the independent variable (i.e.  confounding variables). This increases the reliability of the results, often through a comparison bet ...
. A controlled experiment is used to establish ''causation'', not just ''association''. For example, one could notice that an increase in hot chocolate sales is associated with an increase in skiing accidents.

An experimental unit 
In  statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or country ...
 is an item that an experiment is being performed upon. If the ''experimental unit'' is a person, then it is known as a ''subject''. A response variable 


Dependent and independent variables are  variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
 (also known as a ''dependent variable'') is a possible outcome from an experiment. A ''factor level'' is a characteristic of a factor. A ''treatment'' is an environment consisting of a combination of one level (characteristic) from each of the input factors. A  replicate is the execution of a ''treatment'' on an ''experimental unit'' and yields ''response variables''.

This example builds two ''R'' programs to model an experiment to increase the growth of a species of cactus 






A cactus (, or less commonly, cactus) is a member of the  plant family Cactaceae, a family comprising about 127 genera with some 1750 known species of the order  Caryophyllales. The word ''cactus'' derives, through Latin, from the Ancient Gre ...
. Two ''factors'' are tested:
# water levels of none, light, or medium
# superabsorbent polymer 

A superabsorbent polymer (SAP) (also called slush powder) is a water-absorbing hydrophilic homopolymers or copolymers that can absorb and retain extremely large amounts of a liquid relative to its own mass.

Water-absorbing  polymers, which are cl ...
 levels of not used or used

''R'' program to setup the design:

# Step 1 is to establish the levels of a factor.
# Vector of the water levels:
waterLevel <-
    c(
        "none",
        "light",
        "medium" )

# Step 2 is to create the factor.
# Vector of the water factor:
waterFactor <-
    factor(
        # Although a subset is possible, use all of the levels.
        waterLevel,
        levels = waterLevel )

# Vector of the polymer levels:
polymerLevel <-
    c(
        "notUsed",
        "used" )

# Vector of the polymer factor:
polymerFactor <-
    factor(
        polymerLevel,
        levels = polymerLevel )

# The treatments are the Cartesian product of both factors.
treatmentCartesianProduct <-
    expand.grid(
        waterFactor,
        polymerFactor )

message( "Water factor:" )
print( waterFactor )

message( "\nPolymer factor:" )
print( polymerFactor )

message( "\nTreatment Cartesian product:" )
print( treatmentCartesianProduct )



Output:

Water factor:
 none   light  medium
Levels: none light medium

Polymer factor:
 notUsed used   
Levels: notUsed used

Treatment Cartesian product:
    Var1    Var2
1   none notUsed
2  light notUsed
3 medium notUsed
4   none    used
5  light    used
6 medium    used


''R'' program to store and display the results:

experimentalUnit <- c( "cactus1", "cactus2", "cactus3" )

replicateWater <- c( "none", "light", "medium" )
replicatePolymer <- c( "notUsed", "used", "notUsed" )
replicateInches <- c( 82L, 83L, 84L )

response <-
    data.frame(
        experimentalUnit,
        replicateWater,
        replicatePolymer,
        replicateInches )

print( response )


Output:

  experimentalUnit replicateWater replicatePolymer replicateInches
1          cactus1           none          notUsed              82
2          cactus2          light             used              83
3          cactus3         medium          notUsed              84


  Data frame 

A ''data frame'' stores a two-dimensional array. The horizontal dimension is a list of vectors. The vertical dimension is a list of rows. It is the most useful structure for data analysis 


Data analysis  is a process of inspecting,  cleansing,  transforming, and  modeling  data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, en ...
. ''Data frames'' are created using the data.frame() function. The input is a list of vectors (of any data type). Each vector becomes a column in a table 


Table may refer to:

* Table (furniture), a piece of furniture with a flat surface and one or more legs
* Table (landform), a flat area of land
* Table (information), a data arrangement with rows and columns
* Table (database), how the table data ...
. The elements in each vector are aligned to form the rows in the table.

''R'' program:

integer <- c( 82L, 83L )
string <- c( "hello", "world" )
data.frame <- data.frame( integer, string )
print( data.frame )
message( "Data type:" )
class( data.frame )


Output:

  integer string
1      82  hello
2      83  world
Data type:
 "data.frame"


''Data frames'' can be deconstructed by providing a vector's name between double brackets. 
This returns the original vector. Each element in the returned vector can be accessed by its index number.

''R'' program to extract the word "world". It is stored in the second element of the "string" vector:

integer <- c( 82L, 83L )
string <- c( "hello", "world" )
data.frame <- data.frame( integer, string )
vector <- data.frame "string"
print( vector )
message( "Data type:" )
typeof( vector )


Output:

 "world"
Data type:
 "character"


  Vectorized coding 

''Vectorized coding'' is a method to produce quality ''R'' computer program 


A computer program is a sequence or set of instructions in a  programming language for a computer to  execute. Computer programs are one component of  software, which also includes documentation and other intangible components.

A computer progra ...
s that take advantage of ''Rs strengths. The ''R'' language is designed to be fast at  logical testing, subsetting In research communities (for example, earth sciences, astronomy,  business, and government), subsetting is the process of retrieving just the parts (a subset) of large files which are of interest for a specific purpose. This occurs usually in a clie ...
, and  element-wise execution. On the other hand, ''R'' does not have a fast for loop. For example, ''R'' can  search-and-replace faster using ''logical vectors'' than by using a for loop.

  For loop 

A for loop repeats a  block of code for a specific amount of iterations 



Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration.  ...
.

Example to search-and-replace using a for loop:

vector <- c( "one", "two", "three" )

for ( i in 1:length( vector ) )


message( "Replaced vector:" )
print( vector )


Output:

Replaced vector:
 "1"  "two" "three"


  Subsetting 

''Rs  syntax allows for a  logical vector to be used as an index 

Index (or its plural form indices) may refer to:

 Arts, entertainment, and media Fictional entities
*  Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index''
* The Index, an item on a  Halo megastru ...
 to a vector. This method is called ''subsetting''.

''R'' example:

vector <- c( "one", "two", "three" )

print( vector c( TRUE, FALSE, TRUE ) )


Output:

 "one"    "three"


  Change a value using an index number 

''R'' allows for the assignment operator <- to overwrite an existing value in a vector by using an index number. 

''R'' example:

vector <- c( "one", "two", "three" )
vector 1 <- "1"

print( vector )


Output:

 "1"   "two"   "three"


  Change a value using subsetting 

''R'' also allows for the assignment operator <- to overwrite an existing value in a vector by using a ''logical vector''.

''R'' example:

vector <- c( "one", "two", "three" )
vector c( TRUE, FALSE, FALSE ) <- "1"

print( vector )


Output:

 "1"     "two"   "three"


  Vectorized code to search-and-replace 

Because a ''logical vector'' may be used as an index, and because the ''logical operator'' returns a vector, a search-and-replace can take place without a for loop.

''R'' example:

vector <- c( "one", "two", "three" )
vector vector 
 "one" <- "1"

print( vector )


Output:

 "1"     "two"   "three"


  Functions 

A function 
Function or functionality may refer to:
  Computing 
*  Function key, a type of key on computer keyboards
*  Function model, a structured representation of processes in a system
*  Function object or functor or functionoid, a concept of object-orie ...
 is an object that stores computer code 





A computer is a  machine that can be programmed to carry out sequences of arithmetic or  logical operations (computation) automatically. Modern  digital electronic computers can perform generic sets of operations known as programs. These progr ...
 instead of data 





In the pursuit of  knowledge, data (; ) is a collection of discrete  values that convey  information, describing  quantity,  quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further  interpret ...
. The purpose of storing code inside a function is to be able to reuse it in another context.

  Native functions 

''R'' comes with over 1,000 native functions to perform common tasks. To execute a function:
# type in the function's name
# type in an open parenthesis (
# type in the data to be processed
# type in a close parenthesis )

This example rolls a  die one time. The native function's name is sample. The data to be processed are:
# a  numeric integer vector from one to six
# the size parameter instructs sample to execute the roll one time


sample( 1:6, size=1 )


Possible output:

 6


The ''R''  interpreter provides a help screen for each native function. The help screen is displayed after typing in a question mark followed by the function's name:

?sample


Partial output:

Description:
     ‘sample’ takes a sample of the specified size from the elements of
     ‘x’ using either with or without replacement.

Usage:
     sample(x, size, replace = FALSE, prob = NULL)


 = Function parameters 
=
The sample function has available four  input parameters. ''Input parameters'' are pieces of information that control the function's behavior. ''Input parameters'' may be communicated to the function in a combination of three ways:
# by position separated with commas
#  by name separated with commas and the equal sign
# left empty

For example, each of these calls to sample will roll a die one time:

sample( 1:6, 1, F, NULL )
sample( 1:6, 1 )
sample( 1:6, size=1 )
sample( size=1, x=1:6 )


Every ''input parameter'' has a name. If a function has many parameters, setting name = data will make the source code 




In computing, source code, or simply code, is any collection of code, with or without comment (computer programming), comments, written using a human-readable programming language, usually as plain text. The source code of a Computer program, p ...
 more readable. If the parameter's name is omitted, ''R'' will match the data in the position order. Usually, parameters that are rarely used will have a  default value and may be omitted.

 = Data coupling 
=
The output from a function may become the input to another function. This is the basis for  data coupling.

This example executes the function sample and sends the result to the function sum. It simulates the roll of two dice and adds them up.

sum( sample( 1:6, size=2, replace=TRUE ) )


Possible output:

 7


 = Functions as parameters 
=
A function has parameters typically to input data. Alternatively, a function (A) can use a parameter to input another function (B). Function (A) will assume responsibility to execute function (B).

For example, the function replicate has an input parameter that is a placeholder for another function. This example will execute replicate once, and replicate will execute sample five times. It will simulate rolling a die five times:

replicate( 5, sample( 1:6, size=1 ) )


Possible output:

 2 4 1 4 5


 = Uniform distribution 
=
Because each face of a die is equally likely to appear on top, rolling a die many times generates the  uniform distribution. This example displays a histogram 


A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson.  To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...
 of a die rolled 10,000 times:

hist( replicate( 10000, sample( 1:6, size=1 ) ) )


The output is likely to have a flat top:


 = Central limit theorem 
=
Whereas a  numeric data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular  variable, and each row corresponds to a given record of the d ...
 may have a central tendency 


In  statistics, a central tendency (or measure of central tendency) is a central or typical value for a  probability distribution.Weisberg H.F (1992) ''Central Tendency and Variability'', Sage University Paper Series on Quantitative Applications  ...
, it also may not have a central tendency. Nonetheless, a data set of the arithmetic mean 



In mathematics and statistics, the arithmetic mean ( ) or arithmetic average, or just the ''mean'' or the '' average'' (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. The coll ...
 of many  samples will have a central tendency to converge 
Converge may refer to:

* Converge (band), American hardcore punk band
* Converge (Baptist denomination), American national evangelical Baptist body 
* Limit (mathematics)
* Converge ICT, internet service provider in the Philippines
*CONVERGE CFD s ...
 to the population's mean. The arithmetic mean of a sample is called the sample mean 


The sample mean (or "empirical mean") and the sample covariance are statistics computed from a  sample of data on one or more  random variables.

The sample mean is the average value (or  mean value) of a  sample of numbers taken from a larger po ...
. The central limit theorem 

In  probability theory, the central limit theorem (CLT) establishes that, in many situations, when  independent random variables are summed up, their properly  normalized sum tends toward a  normal distribution even if the original variables thems ...
 states for a sample size of 30 or more, the distribution Distribution may refer to:




  Mathematics 
*Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations
*Probability distribution, the probability of a particular value or value range of a varia ...
 of the ''sample mean'' ( $\bar$ ) is approximately normally distributed 





In  statistics, a normal distribution or Gaussian distribution is a type of  continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^

The parameter \mu is ...
, regardless of the distribution of the variable under consideration ( $x$ ). A histogram 


A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson.  To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...
 displaying a frequency of data point averages will show the distribution of the ''sample mean'' resembles a ''bell-shaped curve''.

For example, rolling one die many times generates the  uniform distribution. Nonetheless, rolling 30 dice and calculating each average ( $\bar$ ) over and over again generates a normal distribution.

''R'' program to roll 30 dice 10,000 times and plot the frequency of averages:

hist(
    replicate(
        10000,
        mean(
            sample(
                1:6,
                size=30,
                replace=T ) ) ) )


The output is likely to have a bell shape:


  Programmer created functions 

To create a function object, execute the function()  statement and assign the result to a name. A function receives input both from global variable 

In computer programming, a global variable is a variable with global scope, meaning that it is visible (hence accessible) throughout the program, unless  shadowed. The set of all global variables is known as the ''global environment'' or ''global  ...
s and  input parameters (often called arguments). Objects created within the function body remain local 
Local may refer to:

 Geography and transportation
* Local (train), a train serving local traffic demand
* Local, Missouri, a community in the United States
* Local government, a form of public administration, usually the lowest tier of administrat ...
 to the function.

''R'' program to create a function:

# The input parameters are x and y.
# The return value is a numeric double vector.
f <- function(x, y)



Usage output:

> f(1, 2)
 8


Function arguments are passed in by  value.

 = If statements 
=
''R'' program illustrating  if statements:

minimum <- function( a, b )


maximum <- function( a, b )


range <- function( a, b, c )


range( 10, 4, 7 )


Output:

 6


 = Generic functions 
=
''R'' supports generic function 

In computer programming, a generic function is a function defined for polymorphism.
  In statically typed languages 
In statically typed languages (such as  C++ and Java), the term ''generic functions'' refers to a mechanism for ''compile-time pol ...
s. They act differently depending on the class 

Class or The Class may refer to:


  Common uses not otherwise categorized
* Class (biology), a taxonomic rank
* Class (knowledge representation), a collection of individuals or objects
*  Class (philosophy), an analytical concept used differently ...
 of the argument passed in. The process is to  dispatch the method 
Method ( grc, μέθοδος, methodos) literally means a pursuit of knowledge, investigation, mode of prosecuting such inquiry, or system. In recent centuries it more often means a prescribed process for completing a task. It may refer to:

*Scien ...
 specific to the class. A common implementation is ''Rs print() function. It can print almost every class of object. For example, print(objectName).

  Normal distribution 

If a  numeric data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular  variable, and each row corresponds to a given record of the d ...
 has a central tendency 


In  statistics, a central tendency (or measure of central tendency) is a central or typical value for a  probability distribution.Weisberg H.F (1992) ''Central Tendency and Variability'', Sage University Paper Series on Quantitative Applications  ...
, it also may have a symmetric 






Symmetry (from  grc, συμμετρία  "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In  mathematics, "symmetry" has a more precise definit ...
 looking histogram 


A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson.  To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...
 — a shape that resembles a bell. If a data set has an approximately bell-shaped histogram, it is said to have a ''normal distribution''.

  Chest size of Scottish militiamen data set 

In 1817, a Scottish 

Scottish usually refers to something of, from, or related to Scotland, including:

*Scottish Gaelic, a Celtic Goidelic language of the Indo-European language family native to Scotland
*Scottish English
*Scottish national identity, the Scottish ide ...
  army contractor measured the chest sizes of 5,732 members of a militia 




A militia () is generally an army or some other  fighting organization of non- professional soldiers, citizens of a country, or subjects of a state, who may perform military service during a time of need, as opposed to a professional force of r ...
 unit. The frequency of each size was:





  Create a comma-separated values file 

''R'' has the write.csv function 
Function or functionality may refer to:
  Computing 
*  Function key, a type of key on computer keyboards
*  Function model, a structured representation of processes in a system
*  Function object or functor or functionoid, a concept of object-orie ...
 to convert a data frame 
A frame is a digital data transmission unit in computer networking and telecommunication. In packet switched systems, a frame is a simple container for a single network packet. In other telecommunications systems, a frame is a repeating structure s ...
 into a  CSV file.

''R'' program to create chestsize.csv:

chestsize <-
c( 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48 )

frequency <-
c( 3, 19, 81, 189, 409, 753, 1062, 1082, 935, 646, 313, 168, 50, 18, 3, 1 )

dataFrame <- data.frame( chestsize, frequency )

write.csv(
    dataFrame,
    file="chestsize.csv",
    # By default, write.csv() creates the first column as the row number.
    row.names = FALSE )


  Import a data set 

The first step in data science 




Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
 is to  import a data set.

''R'' program to import chestsize.csv into a data frame:

dataFrame <- read.csv( "chestsize.csv" )
print( dataFrame )


Output:

   chestsize frequency
1         33         3
2         34        19
3         35        81
4         36       189
5         37       409
6         38       753
7         39      1062
8         40      1082
9         41       935
10        42       646
11        43       313
12        44       168
13        45        50
14        46        18
15        47         3
16        48         1


  Transform a data set 

The second step in ''data science'' is to  transform the data into a format that the functions expect. The chest-size data set is  summarized to frequency; however, ''Rs ''normal distribution'' functions require a  numeric double vector.

''R'' function to convert a summarized to frequency data frame 
A frame is a digital data transmission unit in computer networking and telecommunication. In packet switched systems, a frame is a simple container for a single network packet. In other telecommunications systems, a frame is a repeating structure s ...
 into a vector:

# Filename: frequencyDataFrameToVector.R

frequencyDataFrameToVector <-
    function(
        dataFrame,
        dataColumnName,
        frequencyColumnName = "frequency" )



''R'' has the source() function to  include another ''R'' source file 




In computing, source code, or simply code, is any collection of code, with or without comments, written using a human-readable programming language, usually as plain text. The source code of a  program is specially designed to facilitate the wo ...
 into the current program.

''R'' program to load and display a summary of the 5,732 member data set:

source( "frequencyDataFrameToVector.R" )

dataFrame <- read.csv( "chestsize.csv" )

chestSizeVector <-
    frequencyDataFrameToVector(
        dataFrame,
        "chestsize" )

message( "Head:" )
head( chestSizeVector )

message( "\nTail:" )
tail( chestSizeVector )

message( "\nCount:" )
length( chestSizeVector )

message( "\nMean:" )
mean( chestSizeVector )

message( "\nStandard deviation:" )
sd( chestSizeVector )


Output:

Head:
 33 33 33 34 34 34

Tail:
 46 46 47 47 47 48

Count:
 5732

Mean:
 39.84892

Standard deviation:
 2.073386


  Visualize a data set 

The third step in ''data science'' is to visualize 


''Visualize'' is a video release by Def Leppard. A compilation of promo videos, interviews, and concert footage. On DVD, it is bundled with '' Video Archive''. It won a 1993  Metal Edge Readers' Choice Award for "Best Home Video."Metal Edge, June ...
 the data set. If a histogram 


A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson.  To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...
 of a data set resembles a bell shape, then it is ''normally distributed''.

''R'' program to display a histogram 


A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson.  To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...
 of the data set:

source( "frequencyDataFrameToVector.R" )

dataFrame <- read.csv( "chestsize.csv" )

chestSizeVector <-
    frequencyDataFrameToVector(
        dataFrame,
        "chestsize" )

hist( chestSizeVector )


Output:


  Standardized variable 

Any  variable ( $x_i$ ) in a data set can be converted into a standardized variable 





In  statistics, the standard score is the number of  standard deviations by which the value of a raw score (i.e., an observed value or data point) is above or below the mean value of what is being observed or measured. Raw scores above the mea ...
 ( $z_i$ ). The ''standardized variable'' is also known as a z-score. To calculate the z-score, subtract the ''mean'' and divide by the ''standard deviation''.

:Let  $x$  = a set of data points.
:Let  $\bar$  = the mean of the data set.
:Let  $\sigma$  = the standard deviation of the data set.
:Let  $x_i$  = the  $i^$  element in the set.
:Let  $z_i$  = the z-score of the  $i^$  element in the set.
: $z_i = \frac$ 

''R'' function to convert a measurement to a z-score:

# Filename: zScore.R

zScore <- function( measurement, mean, standardDeviation )



''R'' program to convert a chest size measurement of 38 to a z-score:

source( "zScore.R" )

print( zScore( 38, 39.84892, 2.073386 ) )


Output:

 -0.8917394


''R'' program to convert a chest size measurement of 42 to a z-score:

source( "zScore.R" )

print( zScore( 42, 39.84892, 2.073386 ) )


Output:

 1.037472


  Standardized data set 

A ''standardized data set'' is a data set in which each member of an input data set was run through the zScore function.

''R'' function to convert a  numeric vector into a z-score vector:

# Filename: zScoreVector.R

source( "zScore.R" )

zScoreVector <- function( vector )



  Standardized chest size data set 

''R'' program to standardize the chest size data set:

source( "frequencyDataFrameToVector.R" )
source( "zScoreVector.R" )

dataFrame <- read.csv( "chestsize.csv" )

chestSizeVector <-
    frequencyDataFrameToVector(
        dataFrame,
        dataColumnName = "chestsize" )

zScoreVector <-
    zScoreVector(
        chestSizeVector )

message( "Head:" )
head( zScoreVector )

message( "\nTail:" )
tail( zScoreVector )

message( "\nCount:" )
length( zScoreVector )

message( "\nMean:" )
round( mean( zScoreVector ) )

message( "\nStandard deviation:" )
sd( zScoreVector )

hist( zScoreVector )


Output:

Head:
 -3.303253 -3.303253 -3.303253 -2.820950 -2.820950 -2.820950

Tail:
 2.966684 2.966684 3.448987 3.448987 3.448987 3.931290

Count:
 5732

Mean:
 0

Standard deviation:
 1




  Standard normal curve 


A histogram 


A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson.  To construct a histogram, the first step is to " bin" (or " bucket") the range of values—that is, divide the ent ...
 of a ''normally distributed'' data set that is converted to its ''standardized data set'' also resembles a bell-shaped curve. The curve is called the ''standard normal curve'' or the ''z-curve''. The four basic properties of the ''z-curve'' are:

# The total area under the curve is 1.
# The curve extends indefinitely to the left and right. It never touches the horizontal axis.
# The curve is symmetric and centered at 0.
# Almost all of the area under the curve lies between -3 and 3.

  Area under the standard normal curve 

The probability that a future measurement will be a value between a designated range is equal to the area under the ''standard normal curve'' of the designated range's two ''z-scores''.

For example, suppose the Scottish militia's quartermaster 



Quartermaster is a military term, the meaning of which depends on the country and service. In land  armies, a quartermaster is generally a relatively senior soldier who supervises stores or barracks and distributes  supplies and  provisions. In  ...
 wanted to stock up on uniforms. What is the probability that the next recruit will need a size between 38 and 42?

''R'' program:

library( tigerstats )
source( "frequencyDataFrameToVector.R" )
source( "zScore.R" )

dataFrame <- read.csv( "chestsize.csv" )

chestSizeVector <-
    frequencyDataFrameToVector(
        dataFrame,
        dataColumnName = "chestsize" )

zScore38 <-
    zScore( 38, mean( chestSizeVector ), sd( chestSizeVector ) )

zScore42 <-
    zScore( 42, mean( chestSizeVector ), sd( chestSizeVector ) )

areaLeft38 <- tigerstats::pnormGC( zScore38 )
areaLeft42 <- tigerstats::pnormGC( zScore42 )

areaBetween <- areaLeft42 - areaLeft38

message( "Probability:" )
print( areaBetween )


Output:

Probability:
 0.6639757


The pnormGC() function can compute the probability between a range without first calculating the z-score.

''R'' program:

library( tigerstats )
source( "frequencyDataFrameToVector.R" )

dataFrame <- read.csv( "chestsize.csv" )

chestSizeVector <-
    frequencyDataFrameToVector(
        dataFrame,
        dataColumnName = "chestsize" )

areaBetween <-
    tigerstats::pnormGC(
        c( 38, 42 ),
        mean = mean( chestSizeVector ),
        sd = sd( chestSizeVector ),
        region = "between",
        graph = TRUE )

message( "Probability:" )
print( areaBetween )



Output:

Probability:
 0.6639757




  Packages 



R package 

R packages are extensions to the  R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as ...
s are collections of functions, documentation, and data that expand R. For example, packages add report features such as  RMarkdown, knitr 
knitr is an engine for dynamic report generation with R. It is a  package in the programming language  R that enables integration of R code into LaTeX, LyX, HTML,  Markdown, AsciiDoc, and  reStructuredText documents. The purpose of knitr is to allo ...
 and Sweave Sweave is a function in the statistical programming language  R that enables integration of R code into LaTeX or LyX documents. The purpose is "to create dynamic reports, which can be updated automatically if data or analysis change".

The data anal ...
. Easy package installation and use have contributed to the language's adoption in data science 




Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a bro ...
.

The Comprehensive R Archive Network 

R packages are extensions to the  R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised  software repository such a ...
 (CRAN) was founded in 1997 by Kurt Hornik and Fritz Leisch to host ''Rs source code 




In computing, source code, or simply code, is any collection of code, with or without comment (computer programming), comments, written using a human-readable programming language, usually as plain text. The source code of a Computer program, p ...
, executable files, documentation, and user-created packages. Its name and scope mimic the  Comprehensive TeX Archive Network and the Comprehensive Perl Archive Network 

The Comprehensive Perl Archive Network (CPAN) is a repository of over 250,000  software modules and accompanying documentation for 39,000 distributions, written in the Perl programming language by over 12,000 contributors. ''CPAN'' can denote eith ...
. CRAN originally had three mirrors and 12 contributed packages. As of December 2022, it has 103 mirrors and 18,976 contributed packages.
Packages are also available on repositorie
R-ForgeOmegahat
 an
GitHub


Th
Task Views
on the CRAN website lists packages in fields such as finance, genetics, high-performance computing, machine learning, medical imaging, meta-analysis 





A meta-analysis is a statistical analysis that combines the results of multiple  scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting m ...
, social sciences, and spatial statistics.

The Bioconductor 

Bioconductor is a free, open source and  open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

Bioconductor is based primarily on the statistical R programming  ...
 project provides packages for genomic data analysis, complementary DNA 







In genetics, complementary DNA (cDNA) is  DNA synthesized from a single-stranded RNA (e.g., messenger RNA ( mRNA) or  microRNA (miRNA)) template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to  express a sp ...
, microarray 


 
A microarray is a  multiplex  lab-on-a-chip. Its purpose is to simultaneously detect the expression of thousands of genes from a sample (e.g. from a tissue). It is a two-dimensional array on a  solid substrate—usually a  glass slide or  silic ...
, and high-throughput sequencing 



DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in  DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The ...
 methods.

Packages add the capability to implement various statistical techniques such as linear 




Linearity is the property of a mathematical relationship ('' function'') that can be  graphically represented as a straight  line. Linearity is closely related to '' proportionality''. Examples in physics include  rectilinear motion, the linear ...
,  generalized linear and nonlinear 


In  mathematics and science, a nonlinear system is a  system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...
 modeling, classical statistical tests 



A statistical hypothesis test is a method of  statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
 ...
, spatial 

Spatial may refer to:
*Dimension
*Space
*Three-dimensional space 






Three-dimensional space (also: 3D space, 3-space or, rarely, tri-dimensional space) is a geometric setting in which three values (called ''parameters'') are required to determ ...
 analysis, time-series analysis 


In  mathematics, a time series is a series of data points indexed (or listed or graphed) in time order.  Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Ex ...
, and  clustering.

The tidyverse 

The tidyverse is a collection of open source  packages for the R programming language introduced by  Hadley Wickham and his team that "share an underlying design philosophy, grammar, and data structures" of  tidy data. Characteristic features of t ...
 package is organized to have a common interface. Each function in the package is designed to couple 

Couple or couples may refer to :

 Basic meaning
* Couple (app), a mobile app which provides a mobile messaging service for two people
* Couple (mechanics), a system of forces with a resultant moment but no resultant force
* Couple (relationship), ...
 together all the other functions in the package.

Installing a package occurs only once. To install ''tidyverse'':

> install.packages( "tidyverse" )


To instantiate 

Instantiation or instance may refer to:


 Philosophy
* A modern concept similar to ''participation'' in classical Platonism; see the  Theory of Forms
* The  instantiation principle, the idea that in order for a property to exist, it must be had b ...
 the functions, data, and documentation of a package, execute the library() function. To instantiate ''tidyverse'':

> library( tidyverse )


  Interfaces 

R comes installed with a  command line console. Available for installation are various integrated development environment 



An integrated development environment (IDE) is a  software application that provides comprehensive facilities to computer programmers for software development. An IDE normally consists of at least a  source code editor,  build automation tools a ...
s (IDE). IDEs for R includ
R.app
(OSX/macOS only), Rattle GUI 

Rattle GUI is a free and open source software (GNU GPL v2) package providing a graphical user interface (GUI) for  data mining using the  R statistical programming language. Rattle is used in a variety of situations. Currently there are 15 differe ...
, R Commander 

R Commander (Rcmdr) is a GUI for the R programming language, licensed under the GNU General Public License, and developed and maintained by  John Fox in the sociology department at McMaster University.  Rcmdr looks and works similarly to SPSS GUI  ...
, RKWard 


RKWard is a transparent front-end to the R programming language, a  scripting-language with a strong focus on  statistics functions. RKWard tries to combine the power of the R language with the ease of use of commercial statistical packages.

RKW ...
, RStudio 

RStudio is an integrated development environment for  R, a programming language for statistical computing and graphics.  It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server  ...
, an
Tinn-R


General purpose IDEs that support R include Eclipse 





An eclipse is an astronomical event that occurs when an astronomical object or  spacecraft is temporarily obscured, by passing into the shadow of another body or by having another body pass between it and the viewer. This alignment of three ce ...
 via th
StatET plugin
and Visual Studio 




Visual Studio is an integrated development environment (IDE) from Microsoft. It is used to develop computer programs including websites,  web apps,  web services and mobile apps. Visual Studio uses Microsoft software development platforms such  ...
 via  R Tools for Visual Studio.

Editors that support R include Emacs 



Emacs , originally named EMACS (an acronym for "Editor MACroS"), is a family of  text editors that are characterized by their extensibility. The manual for the most widely used variant,  GNU Emacs, describes it as "the extensible, customizable,  ...
, Vim 

Vim means enthusiasm and vigor. It may also refer to:
*  Vim (cleaning product)
* Vim Comedy Company, a movie studio
* Vim Records
*  Vimentin, a protein
* "Vim", a song by Machine Head on the album ''Through the Ashes of Empires''
*  Vim (text ed ...
 via th
Nvim-R plugin
 Kate Kate name may refer to: 
  People and fictional characters 

*  Kate (given name), a list of people and fictional characters with the given name or nickname
* Gyula Káté (born 1982), Hungarian amateur boxer
*  Lauren Kate (born 1981), American aut ...
, LyX 



LyX (styled as ; pronounced ) (Based on 3 developers, they say it can be pronounced "Licks", "Lucks" and "Leeks") is an open source, graphical user interface document processor based on the LaTeX typesetting system. Unlike most word processors,  ...
 via Sweave Sweave is a function in the statistical programming language  R that enables integration of R code into LaTeX or LyX documents. The purpose is "to create dynamic reports, which can be updated automatically if data or analysis change".

The data anal ...
, WinEdt 

WinEdt is a shareware Unicode (UTF-8) editor and shell (computing), shell for Microsoft Windows. It is primarily used for the creation of TeX (or LaTeX) documents, but can also be used to edit HTML or any other type of text file. It can be configu ...

website
, and Jupyter 



Project Jupyter () is a project with goals to develop open-source software, open standards, and services for interactive computing across multiple programming languages. It was spun off from IPython in 2014 by  Fernando Pérez and Brian Granger. ...

website
.

Scripting languages that support R include  Python
website
, Perl 





Perl is a family of two High-level programming language, high-level, General-purpose programming language, general-purpose, Interpreter (computing), interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it  ...

website
, Ruby 







A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum (aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapp ...

source code
,  F#
website
, and Julia 



Julia is usually a feminine  given name. It is a Latinate feminine form of the name  Julio and  Julius. (For further details on etymology, see the  Wiktionary entry "Julius".) The given name ''Julia'' had been in use throughout Late Antiquity (e ...

source code
.

General purpose programming languages that support R include Java 




Java (;  id, Jawa, ;  jv, ꦗꦮ;  su,  ) is one of the  Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the  Java Sea to the north. With a population of 151.6 million people, Java is the world's  mo ...
 via th
Rserve socket server
 and  .NET C#
website
.

Statistical frameworks which use R in the background include Jamovi 



Jamovi (stylized in all lower-case as jamovi) is a free and open-source computer program for data analysis and performing statistical tests. The core developers of Jamovi are Jonathon Love, Damian Dropmann, and Ravi Selker who are developers for ...
 and JASP 


JASP (Jeffreys’s Amazing Statistics Program) is a  free and open-source program for statistical analysis supported by the University of Amsterdam. It is designed to be easy to use, and familiar to users of  SPSS. It offers standard analysis pro ...
.

  Community 

Th
R Core Team
was founded in 1997 to maintain the ''R'' source code 




In computing, source code, or simply code, is any collection of code, with or without comment (computer programming), comments, written using a human-readable programming language, usually as plain text. The source code of a Computer program, p ...
. Th
R Foundation for Statistical Computing
was founded in April 2003 to provide financial support. Th
R Consortium
is a Linux Foundation 





The Linux Foundation (LF) is a non-profit technology consortium founded in 2000 as a merger between  Open Source Development Labs and the  Free Standards Group to standardize Linux, support its growth, and promote its commercial adoption. Addi ...
 project to develop ''R'' infrastructure.

''The R Journal 
''The R Journal'' is a peer-reviewed open-access scientific journal published by The R Foundation since 2009. It publishes research articles in statistical computing that are of interest to users of the R programming language. The journal includes  ...
'' is an open access 





Open access (OA) is a set of principles and a range of practices through which research outputs are distributed online, free of access charges or other barriers. With open access strictly defined (according to the 2001 definition), or  libre o ...
, academic journal 


An academic journal or scholarly journal is a  periodical publication in which  scholarship relating to a particular academic discipline is published. Academic journals serve as permanent and transparent forums for the presentation, scrutiny, and ...
 which features short to medium-length articles on the use and development of R. It includes articles on packages, programming tips, CRAN news, and foundation news.

The R community hosts many conferences and in-person meetups. These groups include:

* UseR!: an annual international R user conference
website

* Directions in Statistical Computing (DSC)
website

*  R-Ladies: an organization to promote gender diversity 


Gender diversity is equitable or fair representation of people  of different genders. It most commonly refers to an equitable ratio of men and women, but may also include people of non-binary genders.  Gender diversity on corporate boards has bee ...
 in the R community
website

* SatRdays: R-focused conferences held on Saturdays
website

* R Conference
website

* Posit::conf (formerly known as Rstudio::conf)
website


  Implementations 

The main R implementation is written primarily in  C,  Fortran, and  R itself. Other implementations include:

pretty quick R
(pqR), by  Radford M. Neal, attempts to improve memory management.
* Renjin 

Renjin is an implementation of the R (programming language), R programming language atop the Java Virtual Machine. It is free software released under  the GNU General Public License, GPL. Renjin is tightly integrated with Java (programming languag ...
 is an implementation of ''R'' for the Java Virtual Machine 


A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in  other languages that are also compiled to  Java bytecode. The JVM is detailed by a  specification that formally describ ...
.

CXXR
and Riposte are implementations of ''R'' written in  C++.
*  Oracle'sbr>FastR
is an implementation of ''R'', built o
GraalVM

* TIBCO Software 






TIBCO Software Inc. is an American  business intelligence software company founded in 1997 in  Palo Alto, California.

It has headquarters in  Palo Alto, California, and offices in North America, Europe, Asia, the Middle East, Africa and Sout ...
, creator of S-PLUS 


S-PLUS is a commercial implementation of the  S programming language sold by TIBCO Software Inc.

It features object-oriented programming capabilities and advanced analytical algorithms.

Due to the increasing popularity of the open source S succ ...
, wrote TERR — an ''R'' implementation to integrate with Spotfire 

TIBCO Spotfire is an artificial intelligence (AI)-based analytics platform. Before being acquired by  TIBCO in 2007, Spotfire was a business intelligence company based in Somerville, Massachusetts. 
 History
Spotfire was founded by Christopher Ahl ...
.

Microsoft R Open (MRO) was a ''R'' implementation. As of 30 June 2021, Microsoft started to phase out MRO in favor of the CRAN distribution.

  Commercial support 

 

Although R is an open-source 




Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
 project, some companies provide commercial support:
* Revolution Analytics 


Revolution Analytics (formerly REvolution Computing) is a statistical software company focused on developing open source and "open-core" versions of the free and open source software  R for enterprise, academic and analytics customers. Revolution ...
 provides commercial support for Revolution R.
* Oracle 



An oracle is a person or  agency considered to provide wise and insightful counsel or  prophetic  predictions, most notably including  precognition of the future, inspired by  deities. As such, it is a form of  divination.
  Description 
The wor ...
 provides commercial support for the ''Big Data Appliance'', which integrates R into its other products.
*  IBM provides commercial support for in-Hadoop 

Apache Hadoop () is a collection of  open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a  software framework for  distributed storage  ...
 execution of R.

  See also 

* Comparison of numerical-analysis software 


The following tables provide a comparison of  numerical-analysis software.
  Applications 
  General 


  Operating system support 
The operating systems the software can run on natively (without  emulation).


  Language features 
Colors indicat ...

* Comparison of statistical packages 

The following tables compare general and technical information for a number of statistical analysis packages.
 General information

 Operating system support


 ANOVA
Support for various ANOVA methods

 Regression
Support for various  regression m ...

* List of numerical-analysis software 
Listed here are notable end-user computer applications intended for use with  numerical or data analysis:


  Numerical-software packages 

 General-purpose computer algebra systems


 Interface-oriented

 Language-oriented

 Historically significa ...

* List of statistical software 

Statistical software are specialized computer programs for analysis in  statistics and econometrics.
 Open-source

*  ADaMSoft – a generalized statistical software with  data mining algorithms and methods for data management
* ADMB – a softwar ...

* Rmetrics 


Rmetrics is a free, open-source and open development software project for teaching computational finance. Rmetrics is based primarily on the statistical  R programming language, but does contain contributions in other programming languages, Fortr ...



  External links 


R Technical Papers

Free Software Foundation



  Portal 











 Notes



  References 

{{reflist

  
 Array programming languages
 Cross-platform free software
 Data mining and machine learning software
 Data-centric programming languages
 Dynamically typed programming languages
 Free plotting software
 Free statistical software
 Functional languages
 GNU Project software
 Literate programming
 Numerical analysis software for Linux
 Numerical analysis software for macOS
 Numerical analysis software for Windows
 Programming languages created in 1993
 Science software
 Statistical programming languages
 Articles with example R code