Using R Project to plot a histogram and a polygon

Anonymous
timer Asked: Feb 4th, 2019

Question description

Monthly data on Air Passengers is always in R and accessed by the command.

AP=as.numeric(AirPassengers)

this will create an R object called AP in the memory of R. Your job is to first find how many data items are there and then to create a class-interval frequency table with 7 intervals and plot the information on a histogram and a polygon after including two dummy intervals.

Hints: n=length(AP)

use the width formula to determine suitable round number.

I suggest width wid=75 and LLFCI=100

Frequency Distribution, Histogram, Polygon and Ogives Hrishikesh D. Vinod * January 26, 2019 Abstract These class notes on elementary descriptive statistics teach students how to use R with the help of simple notions of frequency distributions, histograms and frequency polygon. It teaches some basic statistical tools for a deeper understanding of available data. The material in the red font can be copied to your R console and the material in the blue font can be verified as the R output. Learning the skill of using R allows students to apply the methods repeatedly to very large data sets in real world situations. 1 R- preliminaries First we clean up R memory and generate illustrative data called ‘xr’ for rounded x. #The # symbol in R means comment. #everything after the symbol on that line is rm(list=ls()) #this cleans R memory options(prompt = " ", continue = " ", width useFancyQuotes = FALSE) # Option settings remove R's `>' prompt and `+' # 'continuation of commands to next line' #these settings allow easy copy and paste of * This ignored by R = 68, as the my code below. note was prepared for my Statistics I students at Fordham. 1 print(date()) #set seed for random number generator in R set.seed(22) x=runif(20,min=1,max=10)#get 20 uniform random numbers in [1, 20] xr=round(x,1);xr#round to one place 1.1 R function to create a sequence of numbers R has a very convenient function to create a sequence of numbers from the starting value (called from), approximate ending value (called to) and the increment (called by). The simplest example is given next. seq(from=1,to=11, by=2) Its output is [1] 1 3 5 7 9 11 Another option available is ‘length.out’ which specifies how many items we want in the sequence (instead of specifying the end value with the ‘to’ value in the ‘seq’ command). seq(from=5,length.out=6, by=1.5) Its output is > seq(from=5,length.out=6, by=1.5) [1] 5.0 6.5 8.0 9.5 11.0 12.5 The R function ‘seq’ will be used for constructing k + 1 class interval limits when making k classes or groups. The increment will be the width of the interval chosen by the formula given below in (1). R function ‘range’ to get Xmin and Xmax from a set of numbers R has a very convenient function to get the smallest and largest values in a set of numbers called ‘range’. It is illustrated as follows. 2 x=c(3,12,4, 8,1) range(x) The output of the code is two numbers 1 and 12 in that order. We shall use the range function in specifying the two x coordinates (xlim=range(brk)) in plotting certain lines on our ogive plots. 2 Typical Descriptive Statistics Questions Now we are ready to state a typical problem in descriptive statistics. Use the following data to construct a frequency distribution table with five classes 3.7 5.3 9.9 5.7 8.6 7.5 6.5 7.7 4.8 4.4 9.7 6.5 5.3 8.7 4.9 1.7 4.8 2.6 6.6 4.3 1. What should be the width of a class interval if we want to group these data into k=5 classes. 2. What is the lower limit of the first class interval when constructing k classes of equal width? 3. What are the class boundaries? 4. What is the frequency in each class interval? 5. What is the relative frequency in each interval? Who proved that it related to probability of randomly selected measurement falling in that interval? (Ans: Lord Keynes) 6. What is the top to bottom cumulative frequency? 7. What is the bottom to top cumulative frequency? 8. Plot both cumulative frequencies on a plot and bring it to class. The real power of using R is to do the descriptive statistics for very large data sets with thousands of numbers, quickly and accurately. I am attempting to teach this skill in this note. 3 3 Frequency Histogram and Polygon We want to group the data containing n numbers into k classes. We define an R object containing the data, then define n and k with the following R code. We use the ‘c’ or catalogue command and insert commas between the numbers as shown in following code (in red font). xr=c(3.7, 5.3, 9.9, 5.7, 8.6, 7.5, 6.5, 7.7, 4.8, 4.4, 9.7, 6.5, 5.3, 8.7, 4.9, 1.7, 4.8, 2.6, 6.6, 4.3) xr #this prints the contents of xr to the screen n=length(xr);n #semicolon allows more R commands on the same line k=5;k #this defines and prints chosen value of k The output of above R code (blue font) is as follows. xr #this prints the contents of xr to the screen [1] 3.7 5.3 9.9 5.7 8.6 7.5 6.5 7.7 4.8 4.4 9.7 6.5 5.3 8.7 4.9 1.7 [17] 4.8 2.6 6.6 4.3 n=length(xr);n #semicolon allows more R commands on the same line [1] 20 k=5;k #this defines and prints chosen value of k [1] 5 Thus we have n = 20 items to classify into k = 5 intervals. The interval width must be chosen according to the formula: Xmax − Xmin , (1) k where Xmax means largest value in the data to be classified and where Xmin means smallest value in the data to be classified. For our data the following code translates the formula into R to help determine the width. The actual width chosen must exceed this quantity and be a nice round number. This involves judgment and may not be a unique choice for every researcher. Xmax=max(xr);Xmax Xmin=min(xr);Xmin width=(Xmax-Xmin)/k;width The output of above code is next. Xmax=max(xr);Xmax [1] 9.9 Xmin=min(xr);Xmin [1] 1.7 width ≥ 4 width=(Xmax-Xmin)/k;width [1] 1.64 We use personal taste to choose a round number for the width as: wid=2, exceeding what the ‘width’ formula gave to be 1.64 above. We begin the actual classification with the ‘lower limit of the first class interval’ (LLFCI), which should be no larger than the smallest data value. LLF CI ≤ Xmin. We pick a round number 0 as LLFCI for our example. wid=2;wid #chosen width 2>1.64, must be larger than what formula gave. LLFCI=0;LLFCI brk=seq(from=0,length.out=k+1,by=wid);brk The last command above uses the rather useful R function ‘seq’ to create a sequence of ‘length.out= k+1’ class break points or class limits of k class intervals starting with 0 (from=0) in sequential increments of chosen width (by=wid). These are called break points for classification. 3.1 Half Open Class Interval limits common in Economics For continuous data the class intervals are constructed as half open intervals [a,b) where the left limit ‘a’ is included but the right limit ‘b’ is not included in the interval expressed this way. The upper range is a value infinitesimally below b. This is a standard mathematical notation and concept. The bracket side value is actually reached, but the parenthesis side value is never reached. The interval ends an infinitesimal value below the limit identified by the ‘)’. For examples: [21,15) is half-open wherein 21 does belong in but 15 does not. [20,30] interval is closed, with both limits in. (100, 200] is half open where the left limit 100 is excluded but right limit 200 is included. In classification jobs it is customary to use left closed and right open intervals. In general, the first interval is [LLFCI, LLFI+wid ). Our example where LLFCI=0 and where the chosen width is 2 the first class interval is or [0, 2). Next interval is [2, 4), and next is [4, 6) and so on with the understanding that binning is done so that number 2 will be in the second class interval [2 ,4) and number 4 will be placed in the third group or third interval [4, 6). 5 3.2 Using R to get break points for half open intervals We need to make k=5 half-open class intervals of width ‘wid=2’. How many break points do we need? Verify that we need k + 1 numbers to define all limits of all class intervals. Using R and our personal judgment in choosing a round number LLFCI=0 as follows: wid=2;wid #define the chosen width exceeding formula value [1] 2 LLFCI=0;LLFCI #lower lim of first class interval [1] 0 brk=seq(from=0,length.out=k+1,by=wid);brk [1] 0 2 4 6 8 10 3.3 Using R to create histogram and a frequency table Now we use the R command ‘hist’ to draw a histogram of the data according to the breaks in the R object called ‘brk’ and defined above. Actually we create an R object called ‘h1’ to hold the histogram. Naming and accessing R outputs The hist function in R classifies the data xr by binning it into half-open intervals. The left side of our R command using the ‘hist’ function has been named by us to be ‘h1’. The choice of the name is completely arbitrary. It is an R object which holds the entire output of the ‘hist’ command. What are the parts of the output and how to access them with the dollar symbol is a worthy skill to learn. The object h1 has several components with self-explanatory names found by the ‘names’ command. All the components are accessible without having to type their values. For example the frequencies are named ‘counts’ and are accessed by the command fr=h1$counts. h1=hist(xr,breaks=brk) names(h1) plot(h1) The output of the ‘plot(h1)’ command is temporarily suppressed, since it is convenient to plot both frequency histogram and polygon on the same plot later. Of course a disadvantage of such a plot is that we will need two dummy intervals and two more break points. I think it is worth the trouble. 6 names(h1) [1] "breaks" "counts" "density" "mids" "xname" [6] "equidist" Of the various names, we would need ‘counts’ for frequency counts and ‘mids’ for mid points of various intervals. We access them by the ‘R object name’ (h1 here) followed by the ‘dollar symbol’ which in turn is followed by ‘a suitable name from the above list’. For example, ‘fr=h1$counts’ will give us the frequencies in the form of an R object we called ‘fr.’ and extracted from the histogram object we named ‘h1’ in earlier commands. For another example, ‘mid=h1$mids’ will give us the frequencies extracted from the object ‘h1’ and named by us to be ‘mid.’ from the histogram object we named ‘h1’. 3.4 Using R to create table for frequency polygon using additional dummy intervals A polygon is a figure with several sides joined. It cannot have hanging sides. So it is customary to insert a dummy class interval with zero frequency at the lower end and another dummy class interval with zero frequency at the top end. This means the augmented sequence of k + 3 break points must start at newLLFCI = (LLFCI − wid) = −2 for our example. Since the total number of break points is k+3, it will include the two dummy intervals in addition to k intervals specified. The new R object containing the break points inclusive of dummy intervals is also named ‘brk’ with two extra break points, and we will redefine the R object ‘h1’ for histogram with these new break points. #revised break points with two extra dummy intervals. brk=seq(from=LLFCI-wid,length.out=k+3,by=wid);brk h1=hist(xr,breaks=brk)#breaks with dummy intervals fr=h1$counts;fr sum(fr)#verify that this is n mid=h1$mids;mid The output of above code is: #revised break points with two extra dummy intervals. brk=seq(from=LLFCI-wid,length.out=k+3,by=wid);brk 7 [1] -2 0 2 4 6 8 10 12 h1=hist(xr,breaks=brk)#breaks with dummy intervals fr=h1$counts;fr [1] 0 1 2 8 5 4 0 sum(fr) [1] 20 mid=h1$mids;mid [1] -1 1 3 5 7 9 11 3.5 Plotting in R from tables Now we are ready to plot the frequency histogram and polygon. ‘plot(h1)’ will plot the simple histogram, but we may want to add a title to it. Titles are given by the option (main=“Title”). Axes labels on x and y axes are optionally indicated as (xlab= ”my X’, ylab=”My y”). The frequency polygon joins the consecutive midpoints at the tops of the histogram boxes. Note that if we did not have the two dummy intervals with zero frequency, the polygon will be left hanging (not be complete). Thus the extra dummy intervals are indeed needed for a sensible polygon. R command ‘lines’ plots lines on top of a plot defined earlier. The horizontal coordinate or X coordinate is always the mid point defined earlier in a vector of values we have chosen to call ‘mid’ according to our code above. The vertical coordinate or Y coordinate for frequency polygon is the frequency which was defined earlier in a vector of values called ‘fr.’ Plot line color ‘col=2’ where 2 means red and plot line type ‘lty=2’ gives a second type of line programmed to be a dashed line. These choices yield red dashed lines for the frequency polygon according to the R code below. plot(h1,main="Frequency histogram and polygon",xlab="Classes", ylab="Frequencies") lines(x=mid, y=fr, col=2, lty=2) up.lims=brk[2:length(brk)];up.lims lo.lims=brk[1:(length(brk)-1)];lo.lims mtx=cbind(lo.lims,up.lims,mid,fr);mtx Constructing a Table (matrix) by binding columns of vectors Recall that we have called ‘lo.lims’ as the vector lower limits of our classes. Also, we have called ‘up.lims’ as the vector upper limits of our classes, ‘mid’ 8 vector contains our midpoints and ‘fr’ contains the frequencies. As long as these vectors are of exactly the same length we can column bind them into a matrix by the R command ‘cbind’ used above: ‘mtx=cbind(lo.lims,up.lims,mid,fr).’ The mtx is the name of R object containing these vectors in the order we choose. brk=seq(from=LLFCI-wid,length.out=k+3,by=wid);brk [1] -2 0 2 4 6 8 10 12 up.lims=brk[2:length(brk)];up.lims [1] 0 2 4 6 8 10 12 lo.lims=brk[1:(length(brk)-1)];lo.lims [1] -2 0 2 4 6 8 10 mtx=cbind(lo.lims,up.lims,mid,fr);mtx lo.lims up.lims mid fr [1,] -2 0 -1 0 [2,] 0 2 1 1 [3,] 2 4 3 2 [4,] 4 6 5 8 [5,] 6 8 7 5 [6,] 8 10 9 4 [7,] 10 12 11 0 9 Figure 1: Histogram and Polygon 4 0 2 Frequencies 6 8 Frequency histogram and polygon −2 0 2 4 6 8 10 12 Classes 4 Median and Two Ogives Classified data frequencies when divided by total frequency n are defined as relative frequencies. Thus relative frequency in j-th class interval is fj /n, where n = Σki=1 fj . We use the example from the previous section and assume that all code used above is in the memory of R. The R command ‘rev’ which reverses the sequence is used below twice to get ‘greater than ogive.’ Less than Ogive, cu.tb=cumulate ‘fr’ top to bottom The “less than” ogive refers to cumulative probability from the top to the bottom denoted as ‘cu.tb’ in the code below. Note that we have two extra dummy classes at the two ends with zero frequencies in each. In the following code we use the ‘cumsum’ function to compute cumulative frequencies. The interpretation of “less than ogive” is that it represents cumulated frequency “less than” the upper limit of each class interval. Hence it will have to be plotted against the upper limits of each class. Our code We list these 10 upper limits in the R object named ‘up.lims’. They are the x-coordinates when plotting the Less Than Ogive. Greater than Ogive, cu.bt=cumulate ‘fr’ bottom to top The “greater than” ogive uses cumulative frequency computed from the bottom to the top. This requires applying ‘cumsum’ function to the frequencies ‘fr’ in reverse order and again reversing the order of the resulting list. Order reversal is achieved by the R function ‘rev’. For example, rev(1,3,6) is (6,3,1). After computing the cumulative frequencies we have to use the ‘rev’ function of R a second time to arrange the cu.bt values correctly in our table. The interpretation of “greater than ogive” is that it represents cumulated frequency “greater than” the lower limit of each class interval. Hence it will have to be plotted against the lower limits of each class. We list these lower limits in the R object named ‘lo.lims’ which become our x-coordinates in plotting the Greater Than Ogive. When using the ‘plot’ function of R remember that the option (typ=“l”) has lower case L not number one and which means we want the lines to be joined. cu.tb=cumsum(fr);cu.tb#cumulative freq. top to bottom m=length(cu.tb);m #how many cumulative frequencies? cu.bt=rev(cumsum(rev(fr)));cu.bt#cum freq bottom to top up.lims=brk[2:length(brk)];up.lims lo.lims=brk[1:(length(brk)-1)];lo.lims plot(up.lims,cu.tb, typ="l", xlim=range(brk), ylim=c(0,n), xlab="Class Range",ylab="cumulative frequencies", main="Median and 'Less than' / 'Greater than' Ogives",col=1 ) lines(lo.lims,cu.bt,typ="l",lty=2,col=2) abline(v=median(xr)) nam=c("Less than", "Greater than") legend(x=-1,y=13,legend=nam, lty=c(1:2),col=1:2) The plot command first plots the in the range defined by x and y axis limits denoted as ‘xlim’ and ‘ylim’. The option ‘lty’ chooses the line type and ‘col’ chooses the color of the line. The command ‘abline(v=median(xr))’ plots a vertical axis (symbol v=’) at the exact median of the original data. The output of above code is as follows. The legend for indicating what lines the graph mean is done by the following two commands ‘nam=c(”Less than”, ”Greater than”)’ which defines name in the legend 11 ‘legend(x=-1,y=13,legend=nam, lty=c(1:2),col=1:2)’ where the x and y coordinates on the plot where the legend sits is chosen with line types and colors by the options ‘lty’ and ‘col’. The option for line type in the command ‘plot(up.lims,cu.tb, typ=”l”)’ specifies the letter lower case L (not the number one) where lower case L stands for line’s joined in the plot. cu.tb=cumsum(fr);cu.tb#cumulative freq. top to bottom [1] 0 1 3 11 16 20 20 m=length(cu.tb);m #how many cumulative frequencies? [1] 7 cu.bt=rev(cumsum(rev(fr)));cu.bt#cum freq bottom to top [1] 20 20 19 17 9 4 0 up.lims=brk[2:length(brk)];up.lims [1] 0 2 4 6 8 10 12 lo.lims=brk[1:(length(brk)-1)];lo.lims [1] -2 0 2 4 6 8 10 plot(up.lims,cu.tb, typ="l", xlim=range(brk), ylim=c(0,n), xlab="Class Range",ylab="cumulative frequencies", main="Median and 'Less than' / 'Greater than' Ogives",col=1 ) lines(lo.lims,cu.bt,typ="l",lty=2,col=2) abline(v=median(xr)) nam=c("Less than", "Greater than") legend(x=-1,y=13,legend=nam, lty=c(1:2),col=1:2) mtx2=cbind(mtx,cu.tb,cu.bt);mtx2 lo.lims up.lims mid fr cu.tb cu.bt [1,] -2 0 -1 0 0 20 [2,] 0 2 1 1 1 20 [3,] 2 4 3 2 3 19 [4,] 4 6 5 8 11 17 [5,] 6 8 7 5 16 9 [6,] 8 10 9 4 20 4 [7,] 10 12 11 0 20 0 The point of intersection of the two ogives has the interpretation that it helps us find on the horizontal scale the value of the measurement where the frequency ‘less than’ this value is exactly the same as the frequency ‘greater than’ it. This is also the definition of the median as the middlemost measurement with equal number of points above and below. 12 Figure 2: Median and Intersecting Ogives 10 Less than Greater than 0 5 cumulative frequencies 15 20 Median and 'Less than' / 'Greater than' Ogives −2 0 2 4 6 8 10 12 Class Range Graphically obtained median suffers from the fact that we have classified the data into class intervals and in effect pretended that all data within an interval equal the mid points of that interval. This causes some loss of accuracy. Figure 2 has both the exact median at the vertical line (at 5.5) which is slightly below the point of intersection of the two ogives. Less than ogive gives the frequency of occurrence of values less than the indicated value on the horizontal axis. By contrast, a ‘greater than ogive’ gives the frequency of occurrence of values larger than the indicated value on the horizontal axis. 5 Empirical Cumulative Distribution Function (ecdf ) & ‘Less Than’ Ogive We have studied the frequency distribution and its graphical representation by using the histogram and polygon. We have seen that there is a loss of information in using classified (grouped) data evidenced by the fact that the 13 true median does not coincide with the measurement at the intersection of the ‘less than ogive’ with the ‘greater than ogive’. How can we avoid this loss of information? This is an old problem in Statistics called efficient and sufficient statistics. The details are beyond the scope of elementary statistics. The idea is to use the sorted data and treat it as a ‘less than ogive’ or empirical cumulative distribution function. It is easiest to explain the idea with the help of our example. set.seed(22) x=runif(20,min=1,max=10)#20 uniform random numbers in [1, 20] xr=round(x,1);xr#round to one place n=length(xr) sxr=sort(xr);sxr #sorted rounded x data fr1=rep(1/n,n)# rep=repeat, frequency for each observation=1/n fr2=c(0,fr1,0)#inserting dummy intervals at start and at end sxr2=c(0,sxr,sxr[n]) #after adding two dummy intervals cumtb=cumsum(fr2)#cumulate top to bottom =tb for less than ogive cumbt=rev(cumsum(rev(fr2)))#cumulate bottom to top=bt cbind(sxr2,fr2,cumtb,cumbt) Now we have constructed n = 20 classes with suitable frequency of 1/n for each class after adding a dummy interval and after defining the cumulative sum of the frequencies in the last column entitled ‘cum’. The contents of this column are obtained by adding frequencies denoted in the code as fr2 from top to bottom, as before, for ‘less than ogive’. sxr2=c(0,sxr,sxr[n]) #after adding two dummy intervals cumtb=cumsum(fr2) cumbt=rev(cumsum(rev(fr2))) cbind(sxr2,fr2,cumtb,cumbt) sxr2 fr2 cumtb cumbt [1,] 0.0 0.00 0.00 1.00 [2,] 1.7 0.05 0.05 1.00 [3,] 2.6 0.05 0.10 0.95 [4,] 3.7 0.05 0.15 0.90 [5,] 4.3 0.05 0.20 0.85 [6,] 4.4 0.05 0.25 0.80 [7,] 4.8 0.05 0.30 0.75 [8,] 4.8 0.05 0.35 0.70 [9,] 4.9 0.05 0.40 0.65 14 [10,] 5.3 0.05 0.45 0.60 [11,] 5.3 0.05 0.50 0.55 [12,] 5.7 0.05 0.55 0.50 [13,] 6.5 0.05 0.60 0.45 [14,] 6.5 0.05 0.65 0.40 [15,] 6.6 0.05 0.70 0.35 [16,] 7.5 0.05 0.75 0.30 [17,] 7.7 0.05 0.80 0.25 [18,] 8.6 0.05 0.85 0.20 [19,] 8.7 0.05 0.90 0.15 [20,] 9.7 0.05 0.95 0.10 [21,] 9.9 0.05 1.00 0.05 [22,] 9.9 0.00 1.00 0.00 R already has the capability to plot the empirical cumulative distribution function (ecdf) by using the code ‘ecdf’ to create the suitable calculations for depicting that the frequency value remains constant between consecutive ordered values while rising in steps of size 1/n for each observation. The option ‘verticals=TRUE’ gives the step function as needed for depicting the ecdf in Figure 3. The cumulative probability obviously goes from 0 to 1. The reverse cumulative probability goes from 1 to 0. plot(ecdf(xr),verticals=TRUE) Note that there is no loss of information, since all data points are used in the construction of the ecdf. One way to verify that there is no loss of information in using ecdf’s is to find the median from intersection of “less than ogive” or ecdf and “greater than ogive” from a reversed ecdf. The intersection of two ogives occurs at the exact median of the data in Figure 4. The code abline(v=median(xr)) shows how to draw a vertical axis at the median of the data. That is, obtaining the median from the ogives involves no loss of information (accuracy). Let us use the above example with n = 20 artificial data used above. The notion of ‘no loss of information’ is described by saying that ecdf is a sufficient statistic. plot(sxr2,cumtb,typ="l",xlab="sorted x", ylab="cumulative probabilities") lines(sxr2,cumbt,typ="l",lty=2,col=2) abline(v=median(xr))#plot vertical axis at the median title("Median from Intersecting Ogives") nam=c("Less than Ogive", "Greater than Ogive") 15 Figure 3: Empirical Cumulative Distribution Function 1.0 ecdf(xr) ● ● ● 0.8 ● ● ● ● 0.6 ● Fn(x) ● 0.4 ● ● ● 0.2 ● ● ● ● 0.0 ● 2 4 6 8 10 x legend(x=0,y=0.8,legend=nam, lty=c(1:2),col=1:2) Histogram does involve some loss of information, but the creation of class intervals is somewhat arbitrary. Frequency polygon is an approximation to the relative frequencies at each measurement. It can be formally proved that frequency polygon in terms of relative frequencies represents the probability distribution associated with the data where some smoothing is involved when we join consecutive midpoints. Using the power of computers this smoothing can be done with more advanced tool called kernel smoothing. It begins with the ecdf (no loss of information in ecdf) and views the density function as a numerical derivative of ecdf. R allows us to see the smoothed probability distribution by simple commands as follows and depicted in Figure 5. plot(density(xr),main="Kernel density") In conclusion, we have learned how to use arbitrary data set and understand the hidden probability structure behind it. 16 Figure 4: Intersection of two Ogives from Empirical Cumulative Distribution Functions and its Reverse After Adding Two Dummy values at the two ends of the data set 0.2 0.4 0.6 Less than Ogive Greater than Ogive 0.0 cumulative probabilities 0.8 1.0 Median from Intersecting Ogives 0 2 4 6 sorted x 17 8 10 Figure 5: Density function from smoothed derivative of ecdf with kernel weights 0.10 0.05 0.00 Density 0.15 Kernel density 0 5 N = 20 10 Bandwidth = 1.051 18 References 19
List of Useful R Commands Hrishikesh D. Vinod * August 22, 2018 Abstract These commands are very basic and are intuitive in most cases. They are adequate for a beginning statistics course. The material in red font in this document can be copied and pasted to your R-GUI (graphical user interface). The material in blue font is the output from R. 1 R- preliminaries and Descriptive Statistics Getting R and R-Studio Visit the website for R and download R https://www.r-project.org Once there see left column top and click on ‘CRAN’ under Download Then choose a mirror site, say https://cloud.r-project.org. Go to download R for windows (say) and pick “base” and accept the defaults. You may also want to (optionally) download integrated development environment (IDE) called RStudio at https://www.rstudio.org for easy manipulation of R code and useful automatic hints while typing R code. Using R First we clean up R memory. rm(list=ls()) * This #this cleans R memory note was prepared for my Statistics I students at Fordham. 1 Some useful commands are listed next. The most important is “c” which allows user to combine/store a list of numbers/things in a vector. Within R commands, note that # means comment. Everything after this symbol in a line is ignored by R. c(1,7,"name") [1] "1" "7" "name" The above output shows that numbers and words can be included in a vector. Of course, the words must be placed in simple quotes (not smart quotes of MS Word) R is an object-oriented language. Almost everything is an object with a name. “x =” or “x < −” are both assignment operations to create an R object named x. R purists do not like = as assignment operator as I do. I like = because it requires less space and less typing. The R obejct names are almost arbitrary, except that they cannot start with numbers or contain symbols. It is not advisable to use common commands as names of R objects (e.g. sum, mean, sd, c, sin, cos, pi, exp etc described later). Everything in R including object names is case-sensitive. Note that 3x is not a valid name of an R object. 3x=1:4 The object name ‘3x’ in the above code returns an ERROR Error: unexpected symbol in "3x" For example, x=5 means 5 stored under a name x. Also x <- c(1,2,3,4) defines variable x as = (1,2,3,4). Alternatively use x=1:4. x=1:4 x #typing the name of an R object asks R to print it to the screen. > x [1] 1 2 3 4 sum(..., na.rm = FALSE) shows that sum is a function always available in R where x is its argument. ‘na.rm=FALSE’ is an optional argument with default value FALSE meaning that if there are missing values (NA’s or notavailable data values) sum will also be NA. This is a useful warning. sum(x) # Calculates the sum of elements in vector x. Now the output of sum command is: > sum(x) [1] 10 Now we illustrate the use of sum in the presence of missing data or NA’s. We create a vector x with five numbers and one NA. To compute the sum 2 correctly, we need to use the option ‘na.rm=TRUE’. Otherwise the sum is NA. This is a useful warning that there are missing data, as can happen unknowingly. x=c(1:3,NA,4);x sum(x) sum(x,na.rm=TRUE) > x=c(1:3,NA,4) > x [1] 1 2 3 NA 4 > sum(x) [1] NA > sum(x,na.rm=TRUE) [1] 10 The above output shows that the sum(x) is NA if we do not recognize the presence of NA and explicitly ask R to remove it (na.rm means remove NAs) before computing the sum. The option ‘na.rm=TRUE’ is available for computation of mean, median, standard deviation, variance, etc. Less sophisticated software gives incorrect number of observations and wrong answers in the presence of NA’s. q() #quits a session. If R is expecting continuing command it prompts with ”+”. It may be an indication that something is wrong and it may be better to press escape key to get out. It can be because parentheses do not match or other syntax errors. pi exp(1) print(c(pi,exp(1))) #prints to screen values of pi and e symbols Note that exp is a function in R and exp(1) means e raised to power 1. Note also that the ‘c’ function of R defines a catalog or list of two or more values. R does not understand a mere list of things without the c command. Print command of R needs the ‘c’ above, because we want to print more than one thing from a list. > pi [1] 3.141593 > exp(1) [1] 2.718282 > print(c(pi,exp(1))) #prints to screen values of pi and e symbols [1] 3.141593 2.718282 3 Thus the transcendental numbers ‘e’ and π are already defined in R as exp(1) and pi. x=123*(10^(-9)) # multiplication is with * and #raise to power is with the ^ symbol in R x=123*(10^(-9)); x #semicolon allows two commands on the same line Now the output of above commands is: > x=123*(10^(-9)) # multiplication is with * and #raise to power is with the ^ symbol in R > x=123*(10^(-9)); x #semicolon allows two commands on the same line [1] 1.23e-07 Printing x as 1.23e-07 is in the scientific notation. If you do not want that, use ‘format’ instead of print withe option scientific=FALSE as below: format(x, scientific=FALSE) # print it as "0.000000123" #this avoids the scientific notation x #without the option, it prints 1.23e-07 or scientific notation. Note that ‘format’ means print. Simple x will print x in scientific notation. (default) > format(x, scientific=FALSE) # print it as "0.000000123" [1] "0.000000123" > #this avoids the scientific notation > x #without the option, it prints 1.23e-07 or scientific notation. [1] 1.23e-07 2 Measures of Central Tendency: mean, median, mode x=c(2,4,0,12,7,2,7,2);x length(x)# how many items in x? mean(x) # Calculates the mean of elements in vector x. median(x) # Calculates the median of x elements. > x=c(2,4,0,12,7,2,7,2);x [1] 2 4 0 12 7 2 7 2 > length(x)# how many items in x? [1] 8 > mean(x) # Calculates the mean of elements in vector x. [1] 4.5 4 > median(x) # Calculates the median of x elements. [1] 3 Sorting and Tabulation Now we consider R functions for sorting and tabulation (mode computation) sort(x)# orders from the smallest to the largest #useful for median, etc table(x) # Calculates the number of repetitions of x values. #Use in finding the mode of x. The output of sort and table commands is: > sort(x) [1] 0 2 2 2 4 7 7 12 > table(x) x 0 2 4 7 12 1 3 1 2 1 Now we consider R functions for percentile calculation. Note median is 50 percentile, first quartile is 25 percentile and third quartile is 75 percentile. 3 Addditional Descriptive Statistics Percentiles are designated as quantiles in Statistical literature. The R function quantile arguments are not in percent terms but as proportion. Thus median proportion is 0.5 and it means that 0.50 proportion of data are below the median of the data. Since statisticians do not agree on which method of computing quantiles is the best, R provides the option to use any one of 8 methods. The default is type=7 which is a bit too sophisticated for Hawkes Learning since the latter uses hand calculations. #computes 5%, 45% and 95% percentiles. x=c(2,4,0,12,7,2,7,2);x quantile(x, probs=c(0.05, 0.45, 0.95), type=1) quantile(x, probs=c(0.05, 0.45, 0.95), type=6) Its output is 5 > x=c(2,4,0,12,7,2,7,2);x [1] 2 4 0 12 7 2 7 2 > quantile(x, probs=c(0.05, 0.45, 0.95), type=1) 5% 45% 95% 0 2 12 > quantile(x, probs=c(0.05, 0.45, 0.95), type=6) 5% 45% 95% 0.0 2.1 12.0 In general the spread of the data is of interest. It is measured by the overall range. Measurement of volatility of stock returns is an important indicator of risk associated with that investment. Besides the range, “deviations from the mean” (x − x̄) provide information regarding the spread of the data with respect to its own mean. However since Σ(x − x̄) = 0 always holds, sum of deviations from the mean will be useless for distinguishing between different data sets. Hence we can compare mean of absolute deviations (MAD) from the mean. Statisticians prefer variance and standard deviation (sd) of elements of vector x over MAD since it has convenient mathematical properties. (e.g. its derivative is easy to compute) n=length(x);n #count how many items in x sqrt(16)#should be 4 square root function is sqrt max(x)-min(x) #defines the range dev=x-mean(x);dev#vector of deviations from the mean of x sum(dev)#must be zero sum(dev^2)/(n-1)# sample variance definition var(x) # Calculates the sample variance of x. #standard deviation is square root of x sqrt(var(x)) sd(x) # direct calculation of sample standard deviation of x sum(dev^2)/n# population variance definition #indirect calculation of population variance from var(x) popvar=var(x)*(n-1)/n;popvar sqrt(popvar) #computes the square root of population variance popsd=sd(x)*sqrt(n-1)/sqrt(n);popsd Output of above code is next. > sum(dev^2)/(n-1)# sample variance definition [1] 15.42857 > var(x) # Calculates the sample variance of x. [1] 15.42857 6 > #standard deviation is square root of x > sqrt(var(x)) [1] 3.927922 > sd(x) # direct calculation of sample standard deviation of x [1] 3.927922 > sum(dev^2)/n# population variance definition [1] 13.5 > #indirect calculation of population variance from var(x) > popvar=var(x)*(n-1)/n;popvar [1] 13.5 > sqrt(popvar) #computes the square root of population variance [1] 3.674235 > popsd=sd(x)*sqrt(n-1)/sqrt(n);popsd [1] 3.674235 factorial(n) # Calculates the factorial of integer x. fivenum(x) # Calculates the five number summary of x. summary(x) #prints six number summary Min, Q1,median, mean, Q3, Max choose(x,y) # Calculates the combination: xCy ways oc choosing y out of x plot(x) # Plots x on a linear graph. length(x) #counts the number of items in x e.g. sample size cumsum(x)#computes cumulative sum e.g., cumulative frequency cbind(x,y,z) #binds column vectors of identical length into a matrix table rbind(x,y,w)#binds row vectors of identical length into a matrix plot(x,y, typ="l") #plots x against y as a line plot type is lower case EL sum(x*y) #computes the sum of corresponding values of x and y vectors 7 set.seed(123) #sets the seed of the random number generator boxplot(x) #creates a box plot for a single variable boxplot(x,y)#creates box plots of x and y for side-by-side comparison hist(x,breaks=5)# creates a histogram with 5 boxes Brackets are used in R for subsetting and parentheses are used for listing arguments of various functions. most of the items above are functions. It is fun use brackets for subsetting. Do not use bracket to give arguments of functions. If we use brackets and type x[3:5] will select subset of items at matrix locations 3 to 5 inclusive. x[-2,-2] will select entire matrix x except for second row and second column The minus means exclude that row or column x[-3, c(-1,-3)] means exclude row 3 and columns 1 and 3 Numerical integration is done in R as follows. integrate the N(0,1) density to give 0.95 answer in the code below. More sophisticated integrations are also available in various packages. Any sequence is created easily by the seq function shown below. integrate(dnorm, -1.96,1.96) sample(1:25) #prints a random sample of first 25 numbers. seq(1,9,by=2)# creates 1,3,5,7,9. seq(from=1,to=9,by=2) ‘read.table’, ‘read.DIF’, etc commands are for reading data. But they can be hard to use. It may be just as good to copy the numbers in MS Word file or text file and read them with x=c(.., ..,) The rounding in R by using the R command round is too sophisticated for Hawkes Learning which uses the biased method we learned in High School. For example, R command round(c(-0.5,0.5,1.5,2.5)) rounds to the nearest even number as (0, 0, 2, 2) to avoid bias. This is different from rounding we learned in High School which would give (−1, 1, 2, 3). 8 4 4.1 Probability Distributions Uniform Distribution How to create random numbers from the uniform density? In R ‘unif’ means uniform and prefix: d means density, p means cumulative probability q means quantile r means random numbers from that density. Thus, plot(dunif) #range is 0 to 1 as default x=runif(10)#creates 10 uniform random numbers in x x #print x punif(1)#area under uniform between 0 to 1 punif(0.5)#area 0 to 0.5 qunif(0.5)# given area=0.5, the qunatile of uniform > x=runif(10)#creates 10 uniform random numbers in x > x #print x [1] 0.22820188 0.01532989 0.12898156 0.09338193 0.23688501 0.79114741 [7] 0.59973157 0.91014771 0.56042455 0.75570477 > punif(1)#area under uniform between 0 to 1 [1] 1 > punif(0.5)#area 0 to 0.5 [1] 0.5 > qunif(0.5)# given area=0.5, the qunatile of uniform 4.2 Binomial Distribution Just like uniform the code name for Binomial is ‘binom’ and all the same prefixes (d,p,q,r) mean the same thing as they did for the uniform density. The Binomial probability for exactly x successes when the probability of one success in one trial is p and when the number of trials is n. Note that n also equals the largest number of successes. p=0.5; n=3; x=0:n db=dbinom(x,prob=p,size=n);db names(db)=x #show labels on x axis barplot(db, xlab="x") 9 The Binomial coefficients (1/8, 2/8, 2/8, 1/8) are correctly produced by dbinom. The graphical output is omitted for brevity. > db=dbinom(x,prob=p,size=n);db [1] 0.125 0.375 0.375 0.125 4.3 Poisson Distribution Again, the code name for Poisson is ‘pois’ and all the same prefixes (d,p,q,r) mean the same thing as they did for the uniform density. lambda=1; x=0:5 dp=dpois(x,lambda);dp names(dp)=x barplot(dp, xlab="x") The Poisson coefficients exp(−λ)λx /x! re correctly calculated by dpois. > lambda=1; x=0:5 > dp=dpois(x,lambda);dp [1] 0.367879441 0.367879441 0.183939721 0.061313240 0.015328310 0.003065662 The plot is omitted for brevity. 4.4 Hypergeometric Distribution The code name for Hypergeometric is ‘hyper’ and all the same prefixes (d,p,q,r) mean the same thing as they did for the uniform density. Unfortunately the notation used in R is less suitable for certain types of word problems. It is easy to write one’s own function to compute these probabilities applicable when Binomial-type trials are dependent (e.g., pulling marbles without replacement). kC (N −k)C P(x)= x N Cn (n−x) , where the range of the discrete random variable is x = 0, 2, . . . min(k, n) N=10;n=4;k=3;minnk=min(n,k) c(N,n,k,minnk)#check the values x=0:minnk px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n);px Although the plot is omitted for brevity, one can check the computations and also that the probabilities add up to unity ΣP (x) = 1 > N=10;n=4;k=3;minnk=min(n,k) > c(N,n,k,minnk) 10 [1] 10 4 3 3 > x=0:minnk > px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n) > px [1] 0.16666667 0.50000000 0.30000000 0.03333333 > names(px)=x;barplot(px) > sum(px) [1] 1 4.5 Normal Density Note that standard Normal variable is the density z ∼ N (0, 1) has mean zero and variance 1. z is defined over the range (−∞, inf ty). However, the practical range is [−4, 4]. As above, the code name for Normal is ‘norm’ and all the same prefixes (d,p,q,r) mean the same thing as they did for the uniform density. z=seq(-4,4,by=.5) dn=dnorm(z) plot(z,dn,typ="l",main="Normal Density") The output of above code is in the attached figure. Option typ=”l” has the letter el not number 1 and suggests a line plot. The opt main labels the graph. set.seed(25) y=rnorm(10)#creates 10 standard normal random numbers N(0,1) y > set.seed(25) > y=rnorm(10)#creates 10 standard normal random numbers N(0,1) > y [1] -0.21183360 -1.04159113 -1.15330756 0.32153150 -1.50012988 -0.44553326 [7] 1.73404543 0.51129562 0.09964504 -0.05789111 The cumulative probability of standard Normal density (z) from z=−∞ to z=1 is given by pnorm. Given the area from z=−∞ to z as say 0.5, qnorm gives the value of z. This is illustrated next. pnorm(1) #area under N(0,1) always from minus infinity qnorm(0.5)#gives the quantile z from cumulative probability The output of above code is next. 11 0.2 0.1 0.0 dn 0.3 0.4 Normal Density −4 −2 0 z Figure 1: Normal density 12 2 4 > pnorm(1) #area under N(0,1) always from minus infinity [1] 0.8413447 > qnorm(0.5)#gives the quantile z from cumulative probability [1] 0 4.6 Student’s t density and degrees of freedom Again, the code name for t-density is ‘t’ and all the same prefixes (d,p,q,r) mean the same thing as they did for the uniform density. For example, if degrees of freedom is specified to be 2 (df=2), the cumulative probability of student’s t from t=−∞ to t=1 is given by the code: pt(1, df=2)\#gives the cumulative probability of t qt(0.05,df=2)# similar to qnorm, gives the quantile of t Note that the function ‘qt’ needs to know the cumulative probability in left tail and yields the value of t at which this cumulative probability holds. The key point to remember is that once we have R, we do not need standard statistical tables for various densities. One can compute all needed probabilities on the fly, as needed, without looking up any tables. 5 Writing your own functions in R This section may be skipped by readers in early stages of learning R. There are literally thousands of functions written by thousands of programmers around the world and freely available in R packages. Here is function to compute the permutations of n things taken k at a time. The formula is nPk= n!/(n − k)!. Let the function name be permute. Note that the word ‘function’ must be present and that it represents an algorithm needing some inputs (arguments) and it returns some output denoted as ‘out’. A function can be a very long command extending over many many lines. Any long command in R can be entered on several lines simply by using curly braces. Thus curly braces have a special meaning in R. The arguments of functions are always in simple parentheses (). We have already mentioned that brackets [] are used for subsetting. permute=function(n,k){ out=factorial(n)/factorial(n-k) return(out) } 13 #example permute(4,2) When you write your own function, it is important to check it aganist a known answer. For example we know 4P2 is 12 and it checks out. > permute=function(n,k){ + out=factorial(n)/factorial(n-k) + return(out) + } > #example > permute(4,2) [1] 12 Now we illustrate a function to compute an alternative version of hypergeometric distribution as follows. myhyper=function(N,n,k){ x=0:min(n,k) px=choose(k,x)*choose((N-k),(n-x)) /choose(N,n) return(px)} #example myhyper(N=10,n=4,k=3) When you write your own function, it is important to check it aganist a known answer given above for N=10, n=4, k=3. Abridged output is: > myhyper(N=10,n=4,k=3) [1] 0.16666667 0.50000000 0.30000000 0.03333333 6 Final Remarks We show that R is far more convenient than a calculator, allowing us the give names to our calculations and implement vast sets of calculations without the tedium. We can also avoid the use of most probability distribution tables. Some useful references are Kerns (2013), Vinod (2008), and Verzani (2009), Kleiber and Zeileis (2008), among others. References Kerns, G. J. (2013), IPSUR: Introduction to Probability and Statistics Using R, r package version 1.5, URL https://CRAN.R-project.org/package= 14 IPSUR. Kleiber, C. and Zeileis, A. (2008), Applied Econometrics with R, New York: Springer-Verlag, ISBN 978-0-387-77316-2, URL http://CRAN.R-project. org/package=AER. Verzani, J. (2009), UsingR: Data sets for the text ”Using R for Introductory Statistics”, r package version 0.1-12, URL https://cran.r-project.org/ doc/contrib/Verzani-SimpleR.pdf. Vinod, H. D. (2008), Hands-on Intermediate Econometrics Using R: Templates for Extending Dozens of Practical Examples, Hackensack, NJ: World Scientific, ISBN 10-981-281-885-5, URL http://www.worldscibooks. com/economics/6895.html. User manuals: The most official Introduction to R is at: https://urldefense.proofpoint.com/v2/url?u=http-3A__colinfay. me_intro-2Dto-2Dr_&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM& r=jOon43tKLVRpvfeQu95XS9U8pSo3ZLUjmqbU_jNBdQE&m=1h-5F1pInvHPTkHiWY15u9b0mnGlpiDlT6 s=lOHna_ND_v7nqRCYlCcgQI_aW_hn4fPgOqOVotP1Es4&e= This is the guide to base R for programmers new to the language. It covers the basic syntax and data types, base graphics, and the built-in statistical modeling functions. (I have a special fondness for this one). The latest version of this file is available at the url: http://www.fordham.edu/economics/vinod/R-commands-1.pdf 15

Tutor Answer

(Top Tutor) Studypool Tutor
School: UT Austin
Studypool has helped 1,244,100 students
flag Report DMCA
Similar Questions
Hot Questions
Related Tags
Study Guides

Brown University





1271 Tutors

California Institute of Technology




2131 Tutors

Carnegie Mellon University




982 Tutors

Columbia University





1256 Tutors

Dartmouth University





2113 Tutors

Emory University





2279 Tutors

Harvard University





599 Tutors

Massachusetts Institute of Technology



2319 Tutors

New York University





1645 Tutors

Notre Dam University





1911 Tutors

Oklahoma University





2122 Tutors

Pennsylvania State University





932 Tutors

Princeton University





1211 Tutors

Stanford University





983 Tutors

University of California





1282 Tutors

Oxford University





123 Tutors

Yale University





2325 Tutors