List of Useful R Commands
Hrishikesh D. Vinod
*
August 22, 2018
Abstract
These commands are very basic and are intuitive in most cases.
They are adequate for a beginning statistics course. The material in
red font in this document can be copied and pasted to your R-GUI
(graphical user interface). The material in blue font is the output from
R.
1
R- preliminaries and Descriptive Statistics
Getting R and R-Studio
Visit the website for R and download R https://www.r-project.org Once
there see left column top and click on ‘CRAN’ under Download Then choose
a mirror site, say https://cloud.r-project.org. Go to download R for
windows (say) and pick “base” and accept the defaults.
You may also want to (optionally) download integrated development environment (IDE) called RStudio at https://www.rstudio.org for easy manipulation of R code and useful automatic hints while typing R code.
Using R
First we clean up R memory.
rm(list=ls())
* This
#this cleans R memory
note was prepared for my Statistics I students at Fordham.
1
Some useful commands are listed next. The most important is “c” which
allows user to combine/store a list of numbers/things in a vector. Within R
commands, note that # means comment. Everything after this symbol in a
line is ignored by R.
c(1,7,"name")
[1] "1"
"7"
"name"
The above output shows that numbers and words can be included in a vector.
Of course, the words must be placed in simple quotes (not smart quotes of
MS Word)
R is an object-oriented language. Almost everything is an object with
a name. “x =” or “x < −” are both assignment operations to create an R
object named x. R purists do not like = as assignment operator as I do. I
like = because it requires less space and less typing.
The R obejct names are almost arbitrary, except that they cannot start
with numbers or contain symbols. It is not advisable to use common commands as names of R objects (e.g. sum, mean, sd, c, sin, cos, pi, exp etc
described later). Everything in R including object names is case-sensitive.
Note that 3x is not a valid name of an R object.
3x=1:4
The object name ‘3x’ in the above code returns an ERROR
Error: unexpected symbol in "3x"
For example, x=5 means 5 stored under a name x. Also x <- c(1,2,3,4)
defines variable x as = (1,2,3,4). Alternatively use x=1:4.
x=1:4
x #typing the name of an R object asks R to print it to the screen.
> x
[1] 1 2 3 4
sum(..., na.rm = FALSE) shows that sum is a function always available in
R where x is its argument. ‘na.rm=FALSE’ is an optional argument with
default value FALSE meaning that if there are missing values (NA’s or notavailable data values) sum will also be NA. This is a useful warning.
sum(x) # Calculates the sum of elements in vector x.
Now the output of sum command is:
> sum(x)
[1] 10
Now we illustrate the use of sum in the presence of missing data or NA’s.
We create a vector x with five numbers and one NA. To compute the sum
2
correctly, we need to use the option ‘na.rm=TRUE’. Otherwise the sum is
NA. This is a useful warning that there are missing data, as can happen
unknowingly.
x=c(1:3,NA,4);x
sum(x)
sum(x,na.rm=TRUE)
> x=c(1:3,NA,4)
> x
[1] 1 2 3 NA 4
> sum(x)
[1] NA
> sum(x,na.rm=TRUE)
[1] 10
The above output shows that the sum(x) is NA if we do not recognize the
presence of NA and explicitly ask R to remove it (na.rm means remove NAs)
before computing the sum.
The option ‘na.rm=TRUE’ is available for computation of mean, median,
standard deviation, variance, etc. Less sophisticated software gives incorrect
number of observations and wrong answers in the presence of NA’s.
q() #quits a session. If R is expecting continuing command it prompts
with ”+”. It may be an indication that something is wrong and it may be
better to press escape key to get out. It can be because parentheses do not
match or other syntax errors.
pi
exp(1)
print(c(pi,exp(1))) #prints to screen values of pi and e symbols
Note that exp is a function in R and exp(1) means e raised to power 1. Note
also that the ‘c’ function of R defines a catalog or list of two or more values.
R does not understand a mere list of things without the c command. Print
command of R needs the ‘c’ above, because we want to print more than one
thing from a list.
> pi
[1] 3.141593
> exp(1)
[1] 2.718282
> print(c(pi,exp(1))) #prints to screen values of pi and e symbols
[1] 3.141593 2.718282
3
Thus the transcendental numbers ‘e’ and π are already defined in R as exp(1)
and pi.
x=123*(10^(-9)) # multiplication is with * and
#raise to power is with the ^ symbol in R
x=123*(10^(-9)); x #semicolon allows two commands on the same line
Now the output of above commands is:
> x=123*(10^(-9)) # multiplication is with * and
#raise to power is with the ^ symbol in R
> x=123*(10^(-9)); x #semicolon allows two commands on the same line
[1] 1.23e-07
Printing x as 1.23e-07 is in the scientific notation. If you do not want that,
use ‘format’ instead of print withe option scientific=FALSE as below:
format(x, scientific=FALSE) # print it as "0.000000123"
#this avoids the scientific notation
x #without the option, it prints 1.23e-07 or scientific notation.
Note that ‘format’ means print. Simple x will print x in scientific notation.
(default)
> format(x, scientific=FALSE) # print it as "0.000000123"
[1] "0.000000123"
> #this avoids the scientific notation
> x #without the option, it prints 1.23e-07 or scientific notation.
[1] 1.23e-07
2
Measures of Central Tendency: mean, median, mode
x=c(2,4,0,12,7,2,7,2);x
length(x)# how many items in x?
mean(x) # Calculates the mean of elements in vector x.
median(x) # Calculates the median of x elements.
> x=c(2,4,0,12,7,2,7,2);x
[1] 2 4 0 12 7 2 7 2
> length(x)# how many items in x?
[1] 8
> mean(x) # Calculates the mean of elements in vector x.
[1] 4.5
4
> median(x) # Calculates the median of x elements.
[1] 3
Sorting and Tabulation
Now we consider R functions for sorting and tabulation (mode computation)
sort(x)# orders from the smallest to the largest
#useful for median, etc
table(x) # Calculates the number of repetitions of x values.
#Use in finding the mode of x.
The output of sort and table commands is:
> sort(x)
[1] 0 2 2 2 4 7 7 12
> table(x)
x
0 2 4 7 12
1 3 1 2 1
Now we consider R functions for percentile calculation. Note median is 50
percentile, first quartile is 25 percentile and third quartile is 75 percentile.
3
Addditional Descriptive Statistics
Percentiles are designated as quantiles in Statistical literature. The R function quantile arguments are not in percent terms but as proportion. Thus
median proportion is 0.5 and it means that 0.50 proportion of data are below
the median of the data. Since statisticians do not agree on which method of
computing quantiles is the best, R provides the option to use any one of 8
methods. The default is type=7 which is a bit too sophisticated for Hawkes
Learning since the latter uses hand calculations.
#computes 5%, 45% and 95% percentiles.
x=c(2,4,0,12,7,2,7,2);x
quantile(x, probs=c(0.05, 0.45, 0.95), type=1)
quantile(x, probs=c(0.05, 0.45, 0.95), type=6)
Its output is
5
> x=c(2,4,0,12,7,2,7,2);x
[1] 2 4 0 12 7 2 7 2
> quantile(x, probs=c(0.05, 0.45, 0.95), type=1)
5% 45% 95%
0
2 12
> quantile(x, probs=c(0.05, 0.45, 0.95), type=6)
5% 45% 95%
0.0 2.1 12.0
In general the spread of the data is of interest. It is measured by the
overall range. Measurement of volatility of stock returns is an important
indicator of risk associated with that investment. Besides the range, “deviations from the mean” (x − x̄) provide information regarding the spread of
the data with respect to its own mean. However since Σ(x − x̄) = 0 always
holds, sum of deviations from the mean will be useless for distinguishing between different data sets. Hence we can compare mean of absolute deviations
(MAD) from the mean. Statisticians prefer variance and standard deviation
(sd) of elements of vector x over MAD since it has convenient mathematical
properties. (e.g. its derivative is easy to compute)
n=length(x);n #count how many items in x
sqrt(16)#should be 4 square root function is sqrt
max(x)-min(x) #defines the range
dev=x-mean(x);dev#vector of deviations from the mean of x
sum(dev)#must be zero
sum(dev^2)/(n-1)# sample variance definition
var(x) # Calculates the sample variance of x.
#standard deviation is square root of x
sqrt(var(x))
sd(x) # direct calculation of sample standard deviation of x
sum(dev^2)/n# population variance definition
#indirect calculation of population variance from var(x)
popvar=var(x)*(n-1)/n;popvar
sqrt(popvar) #computes the square root of population variance
popsd=sd(x)*sqrt(n-1)/sqrt(n);popsd
Output of above code is next.
> sum(dev^2)/(n-1)# sample variance definition
[1] 15.42857
> var(x) # Calculates the sample variance of x.
[1] 15.42857
6
> #standard deviation is square root of x
> sqrt(var(x))
[1] 3.927922
> sd(x) # direct calculation of sample standard deviation of x
[1] 3.927922
> sum(dev^2)/n# population variance definition
[1] 13.5
> #indirect calculation of population variance from var(x)
> popvar=var(x)*(n-1)/n;popvar
[1] 13.5
> sqrt(popvar) #computes the square root of population variance
[1] 3.674235
> popsd=sd(x)*sqrt(n-1)/sqrt(n);popsd
[1] 3.674235
factorial(n) # Calculates the factorial of integer x.
fivenum(x) # Calculates the five number summary of x.
summary(x) #prints six number summary Min, Q1,median, mean, Q3, Max
choose(x,y) # Calculates the combination: xCy ways oc choosing y out of x
plot(x) # Plots x on a linear graph.
length(x) #counts the number of items in x e.g. sample size
cumsum(x)#computes cumulative sum e.g., cumulative frequency
cbind(x,y,z) #binds column vectors of identical length into a matrix table
rbind(x,y,w)#binds row vectors of identical length into a matrix
plot(x,y, typ="l") #plots x against y as a line plot type is lower case EL
sum(x*y) #computes the sum of corresponding values of x and y vectors
7
set.seed(123) #sets the seed of the random number generator
boxplot(x) #creates a box plot for a single variable
boxplot(x,y)#creates box plots of x and y for side-by-side comparison
hist(x,breaks=5)# creates a histogram with 5 boxes
Brackets are used in R for subsetting and parentheses are used for listing
arguments of various functions. most of the items above are functions. It
is fun use brackets for subsetting. Do not use bracket to give arguments of
functions.
If we use brackets and type x[3:5] will select subset of items at matrix
locations 3 to 5 inclusive.
x[-2,-2] will select entire matrix x except for second row and second column
The minus means exclude that row or column
x[-3, c(-1,-3)] means exclude row 3 and columns 1 and 3
Numerical integration is done in R as follows. integrate the N(0,1) density
to give 0.95 answer in the code below. More sophisticated integrations are
also available in various packages.
Any sequence is created easily by the seq function shown below.
integrate(dnorm, -1.96,1.96)
sample(1:25) #prints a random sample of first 25 numbers.
seq(1,9,by=2)# creates 1,3,5,7,9.
seq(from=1,to=9,by=2)
‘read.table’, ‘read.DIF’, etc commands are for reading data. But they can be
hard to use. It may be just as good to copy the numbers in MS Word file or
text file and read them with x=c(.., ..,)
The rounding in R by using the R command round is too sophisticated for
Hawkes Learning which uses the biased method we learned in High School.
For example, R command round(c(-0.5,0.5,1.5,2.5)) rounds to the
nearest even number as (0, 0, 2, 2) to avoid bias. This is different from rounding we learned in High School which would give (−1, 1, 2, 3).
8
4
4.1
Probability Distributions
Uniform Distribution
How to create random numbers from the uniform density? In R ‘unif’ means
uniform and prefix:
d means density,
p means cumulative probability
q means quantile
r means random numbers from that density. Thus,
plot(dunif) #range is 0 to 1 as default
x=runif(10)#creates 10 uniform random numbers in x
x #print x
punif(1)#area under uniform between 0 to 1
punif(0.5)#area 0 to 0.5
qunif(0.5)# given area=0.5, the qunatile of uniform
> x=runif(10)#creates 10 uniform random numbers in x
> x #print x
[1] 0.22820188 0.01532989 0.12898156 0.09338193 0.23688501 0.79114741
[7] 0.59973157 0.91014771 0.56042455 0.75570477
> punif(1)#area under uniform between 0 to 1
[1] 1
> punif(0.5)#area 0 to 0.5
[1] 0.5
> qunif(0.5)# given area=0.5, the qunatile of uniform
4.2
Binomial Distribution
Just like uniform the code name for Binomial is ‘binom’ and all the same
prefixes (d,p,q,r) mean the same thing as they did for the uniform density.
The Binomial probability for exactly x successes when the probability of
one success in one trial is p and when the number of trials is n. Note that n
also equals the largest number of successes.
p=0.5; n=3; x=0:n
db=dbinom(x,prob=p,size=n);db
names(db)=x #show labels on x axis
barplot(db, xlab="x")
9
The Binomial coefficients (1/8, 2/8, 2/8, 1/8) are correctly produced by
dbinom. The graphical output is omitted for brevity.
> db=dbinom(x,prob=p,size=n);db
[1] 0.125 0.375 0.375 0.125
4.3
Poisson Distribution
Again, the code name for Poisson is ‘pois’ and all the same prefixes (d,p,q,r)
mean the same thing as they did for the uniform density.
lambda=1; x=0:5
dp=dpois(x,lambda);dp
names(dp)=x
barplot(dp, xlab="x")
The Poisson coefficients exp(−λ)λx /x! re correctly calculated by dpois.
> lambda=1; x=0:5
> dp=dpois(x,lambda);dp
[1] 0.367879441 0.367879441 0.183939721 0.061313240 0.015328310 0.003065662
The plot is omitted for brevity.
4.4
Hypergeometric Distribution
The code name for Hypergeometric is ‘hyper’ and all the same prefixes
(d,p,q,r) mean the same thing as they did for the uniform density. Unfortunately the notation used in R is less suitable for certain types of word
problems. It is easy to write one’s own function to compute these probabilities applicable when Binomial-type trials are dependent (e.g., pulling marbles
without replacement).
kC
(N −k)C
P(x)= x N Cn (n−x) , where the range of the discrete random variable
is x = 0, 2, . . . min(k, n)
N=10;n=4;k=3;minnk=min(n,k)
c(N,n,k,minnk)#check the values
x=0:minnk
px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n);px
Although the plot is omitted for brevity, one can check the computations and
also that the probabilities add up to unity ΣP (x) = 1
> N=10;n=4;k=3;minnk=min(n,k)
> c(N,n,k,minnk)
10
[1] 10 4 3 3
> x=0:minnk
> px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n)
> px
[1] 0.16666667 0.50000000 0.30000000 0.03333333
> names(px)=x;barplot(px)
> sum(px)
[1] 1
4.5
Normal Density
Note that standard Normal variable is the density z ∼ N (0, 1) has mean
zero and variance 1. z is defined over the range (−∞, inf ty). However, the
practical range is [−4, 4]. As above, the code name for Normal is ‘norm’
and all the same prefixes (d,p,q,r) mean the same thing as they did for the
uniform density.
z=seq(-4,4,by=.5)
dn=dnorm(z)
plot(z,dn,typ="l",main="Normal Density")
The output of above code is in the attached figure. Option typ=”l” has
the letter el not number 1 and suggests a line plot. The opt main labels the
graph.
set.seed(25)
y=rnorm(10)#creates 10 standard normal random numbers N(0,1)
y
> set.seed(25)
> y=rnorm(10)#creates 10 standard normal random numbers N(0,1)
> y
[1] -0.21183360 -1.04159113 -1.15330756 0.32153150 -1.50012988 -0.44553326
[7] 1.73404543 0.51129562 0.09964504 -0.05789111
The cumulative probability of standard Normal density (z) from z=−∞ to
z=1 is given by pnorm. Given the area from z=−∞ to z as say 0.5, qnorm
gives the value of z. This is illustrated next.
pnorm(1) #area under N(0,1) always from minus infinity
qnorm(0.5)#gives the quantile z from cumulative probability
The output of above code is next.
11
0.2
0.1
0.0
dn
0.3
0.4
Normal Density
−4
−2
0
z
Figure 1: Normal density
12
2
4
> pnorm(1) #area under N(0,1) always from minus infinity
[1] 0.8413447
> qnorm(0.5)#gives the quantile z from cumulative probability
[1] 0
4.6
Student’s t density and degrees of freedom
Again, the code name for t-density is ‘t’ and all the same prefixes (d,p,q,r)
mean the same thing as they did for the uniform density. For example, if
degrees of freedom is specified to be 2 (df=2), the cumulative probability of
student’s t from t=−∞ to t=1 is given by the code:
pt(1, df=2)\#gives the cumulative probability of t
qt(0.05,df=2)# similar to qnorm, gives the quantile of t
Note that the function ‘qt’ needs to know the cumulative probability in left
tail and yields the value of t at which this cumulative probability holds.
The key point to remember is that once we have R, we do not need
standard statistical tables for various densities. One can compute all needed
probabilities on the fly, as needed, without looking up any tables.
5
Writing your own functions in R
This section may be skipped by readers in early stages of learning R. There
are literally thousands of functions written by thousands of programmers
around the world and freely available in R packages.
Here is function to compute the permutations of n things taken k at a
time. The formula is nPk= n!/(n − k)!. Let the function name be permute.
Note that the word ‘function’ must be present and that it represents an algorithm needing some inputs (arguments) and it returns some output denoted
as ‘out’. A function can be a very long command extending over many many
lines. Any long command in R can be entered on several lines simply by using
curly braces. Thus curly braces have a special meaning in R. The arguments
of functions are always in simple parentheses (). We have already mentioned
that brackets [] are used for subsetting.
permute=function(n,k){
out=factorial(n)/factorial(n-k)
return(out)
}
13
#example
permute(4,2)
When you write your own function, it is important to check it aganist a
known answer. For example we know 4P2 is 12 and it checks out.
> permute=function(n,k){
+ out=factorial(n)/factorial(n-k)
+ return(out)
+ }
> #example
> permute(4,2)
[1] 12
Now we illustrate a function to compute an alternative version of hypergeometric distribution as follows.
myhyper=function(N,n,k){
x=0:min(n,k)
px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n)
return(px)}
#example
myhyper(N=10,n=4,k=3)
When you write your own function, it is important to check it aganist a
known answer given above for N=10, n=4, k=3. Abridged output is:
> myhyper(N=10,n=4,k=3)
[1] 0.16666667 0.50000000 0.30000000 0.03333333
6
Final Remarks
We show that R is far more convenient than a calculator, allowing us the give
names to our calculations and implement vast sets of calculations without the
tedium. We can also avoid the use of most probability distribution tables.
Some useful references are Kerns (2013), Vinod (2008), and Verzani (2009),
Kleiber and Zeileis (2008), among others.
References
Kerns, G. J. (2013), IPSUR: Introduction to Probability and Statistics Using
R, r package version 1.5, URL https://CRAN.R-project.org/package=
14
IPSUR.
Kleiber, C. and Zeileis, A. (2008), Applied Econometrics with R, New York:
Springer-Verlag, ISBN 978-0-387-77316-2, URL http://CRAN.R-project.
org/package=AER.
Verzani, J. (2009), UsingR: Data sets for the text ”Using R for Introductory
Statistics”, r package version 0.1-12, URL https://cran.r-project.org/
doc/contrib/Verzani-SimpleR.pdf.
Vinod, H. D. (2008), Hands-on Intermediate Econometrics Using R: Templates for Extending Dozens of Practical Examples, Hackensack, NJ: World
Scientific, ISBN 10-981-281-885-5, URL http://www.worldscibooks.
com/economics/6895.html.
User manuals: The most official Introduction to R is at:
https://urldefense.proofpoint.com/v2/url?u=http-3A__colinfay.
me_intro-2Dto-2Dr_&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&
r=jOon43tKLVRpvfeQu95XS9U8pSo3ZLUjmqbU_jNBdQE&m=1h-5F1pInvHPTkHiWY15u9b0mnGlpiDlT6
s=lOHna_ND_v7nqRCYlCcgQI_aW_hn4fPgOqOVotP1Es4&e=
This is the guide to base R for programmers new to the language. It covers
the basic syntax and data types, base graphics, and the built-in statistical
modeling functions. (I have a special fondness for this one).
The latest version of this file is available at the url:
http://www.fordham.edu/economics/vinod/R-commands-1.pdf
15