Harvard Stats Homework Using R Lab 5 Help
stats homework using RHomework- lab 5:#The central limit theorem (CLT) states that as the sample size gets sufficiently large, the distribution of the sample means will be normally distributed.#In addition, the CLT has been used to justify the fact that for many of our statistics we rely upon computing the mean (not median or trimmed mean) of our samples#There are a few problems with the CLT. #1) How large of a sample is needed#2) It seems that our experiments with the contaminated normal may contradict this.#In this homework assignment you will investigate the CLT further. #PART 1 - The Central Limit Theorem under Normality.#1.1) Simulate a standard normal population of 1 million people called pop1 #1.2) Draw 5000 samples of size 20 and put these in sam20. Draw 5000 samples of size 50 and put these in sam50 .#1.3) Create variables called sam20means and sam50means that contains the means of the samples . Use a density plot to show the sampling distribution of the means for sam20means and sam50means together#1.4) Compare the Standard Error (SE) of the sampling distributions. Which sample size creates better estimates of the population mean (ie. has the lowest SE)? #PART 2 - The Central Limit Theorem under Non-Normality#2.1) Simulate a contaminated normal population using cnorm() of 1 million people called pop2 where 30% (epsilon=0.3) of the data have an SD of 30 (k=30) .#2.2) Draw 5000 samples of size 30 and put these in sam30. Draw 5000 samples of size 100 and put these in sam100.#2.3) Create variables called sam30means, sam30tmeans, sam100means, sam100tmeans that represent the means AND trimmed means for the samples. #2.4) Use a density plot to show the sampling distribution of the means and trimmed means for these variables.#2.5) Compare the Standard Error (SE) of the sampling distributions. #2.6) Which would be better here: a larger sample size using the mean as the location estimator OR a smaller sample using the trimmed mean? #2.7) Which location estimator performs the best, regardless of sample size?-------------------------------------------------------------------------------------------------------------------------------------------------Lab 5 lecture notes:#Lab 5-Contents# 1. Sampling Distribution of the Mean, # Median, and Trimmed Mean under Normality# 2. Sampling Distribution of the Mean, # Median, and Trimmed Mean under Non-Normality# 3. The Central Limit Theorem# Last week we saw that when we had a Normal or Uniform population, # that the means of random samples taken from that population #were normally distributed.#Today we are going to investigate the distributions of the mean,#median, and trimmed mean from samples coming from Normal # and non-normal populations.#---------------------------------------------------------------------------------# 1. Sampling Distribution of the Mean, Median, # and Trimmed Mean under Normality#--------------------------------------------------------------------------------- #Let's start by generating a standard normal distribution (mean=0, SD=1) for 1 million subjectspop1 = rnorm(1000000, mean=0, sd=1) #We will use this as our population from a normal distribution #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*##EXERCISE 1-1: #A) Find the mean, median, trimmed mean (using tmean() ), and sd of pop1#B) Draw a density plot of pop1#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# #A)mean(pop1); median(pop1); tmean(pop1); sd(pop1) #B) plot(density(pop1)) #Like we did last week, we are going to want to take random samples # from our population and then compute a measure of central tendency #(eg. mean, median, trimmed mean) for each sample and examine #the distribution of this measure. #We are going to take 5000 samples of 20 subjectssam1 = matrix(, ncol=5000, nrow=20)#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*## EXERCISE 1-2: Use a loop to draw 5000 samples of size 20 from pop1 # an place the samples in sam1#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#for (ii in 1:5000) {sam1[ , ii] = sample(pop1, 20, replace=TRUE)} # Now that we have our datafile containing all 5000 samples (ie. sam1) # we can begin to create variables for each of our location measures #I'll start us off with the meansam1means = apply(sam1, 2, mean) # number 2 = work in the columns rather than rows#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*## EXERCISE 1-3: Use the apply function to generate # the variables sam1meds (medians) and sam1tmeans (trimmed mean)#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#sam1meds = apply(sam1, 2, median)sam1tmeans = apply(sam1, 2, tmean) #Let's look at the distributions of each of these location estimatorsplot(density(sam1means))lines(density(sam1meds), col="red")lines(density(sam1tmeans), col="blue")abline(v = mean(pop1), lty=2) #Add in a line for the pop1 mean#??????????????????????????????????????????????????????????????##Thought Question 1: Which location estimator performs the best #for data coming from a normal population? Why?#??????????????????????????????????????????????????????????????# # One of the ways we can determine which location estimator # performs the best is by looking at the standard deviation # of the estimator accross all the samples. # The estimator with the lowest SD will have the least amount # of variability accross the samples. # A more common name for the standard deviation of the location # estimator is called the Standard Error or SE#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*## EXERCISE 1-4: Find the Standard Error of the sample means,# medians, and trimmed means. Based upon the SE, which # location estimator is the best for samples coming from# a normal population?#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# sd(sam1means); sd(sam1meds); sd(sam1tmeans) #The mean performs the best.# In real life, we generally cannot go out an collect multiple samples # from a population, so we compute the Standard Error using a formula: # SE = sd(sample) / sqrt(sample N)#---------------------------------------------------------------------------------# 2. Sampling Distribution of the Mean, Median, # and Trimmed Mean under Non-Normality#---------------------------------------------------------------------------------# Normal distributions generally have very few outliers, # however when outliers begin to occur more frequently so of the # basic assumptions about normal distributions are no longer true # (as we are about to see). # One distribution that is like a normal distribution,# but with more outliers is called a mixed or contaminated # normal distribution and it is a result of two populations mixing together. #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# #EXAMPLE 1: "a" will be a mix of TWO populations 1: with SD=1 and 2: with SD=2 a=c(rnorm(5000, 0, 1), rnorm(5000, 0, 2)) #Let's compare this to b, which is from ONE population but with the same parameters of a b=rnorm(10000, mean(a), sd(a))plot(density(a))lines(density(b), col="red") #*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*##??????????????????????????????????????????????????????????????##Thought Question 2: How are a and b from Example 1 different?#??????????????????????????????????????????????????????????????##Thankfully, rather than having to create contaminated normal distributions the hard way, we can just use#a function provided to us by Dr. Wilcox called cnorm()#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^##Contaminated/Mix Normal Distribution: cnorm(n, epsilon=0.1, k=10)#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^##Let's look at the options for the contaminated normal distribution:#cnorm() combines two normal distributions: #1) A standard normal (mean=0, sd=1) for 1-epsilon % of the data #2) A normal of mean=0 and sd=k for epsilon % of the data #If we were trying to re-create the variable a we made in example 1 we would have to do:z=cnorm(10000, epsilon=0.5, k=2)plot(density(a))lines(density(z), col="blue")#Which looks very very similar to a! #Let's create a second population called pop2 from a contaminated normal distributionpop2 = cnorm(1000000, epsilon=0.1, k=10) #The mean, sd, and plot of which are:mean(pop2); sd(pop2); plot(density(pop2))#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*##EXERCISE 2:#A) Create an empty matrix called sam2 to contain 5000 samples #of 20 observations each#B) Populate sam2 with 5000 random samples of size 20 from pop2#C) Compute the mean (sam2means), median (sam2meds), #and trimmed mean (sam2tmeans) for each sample#D) Create an overlaid density plot of each sample WITH the pop2 #mean as a verticle line#E) Find the SE of each location estimator#F) Based upon the SE, which location estimator is the best # for samples coming from a contaminated normal distribution#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*# #A) #B) #C) #D) #E) #F)#---------------------------------------------------------------------------------# 3. The Central Limit Theorem#--------------------------------------------------------------------------------- #We've discovered a few things today: #1) When a population comes from a normal distribution, # then mean will be the best location estimator of the samples#2) When a population comes from a mixed/contaminated normal distribution, # the trimmed mean is the best location estimator# These observations are related to the Central Limit Theorem (CLT)# that is discussed in Section 5.3 of the book (page 85)# The CLT states that as the sample size gets sufficiently large, # the distribution of the sample means will be normally distributed.# We saw a demonstration of this last week when we looked at the means # from the unifom distribution.# The CLT has been used to justify the fact that for many of our statistics# we rely upon computing the mean (not median or trimmed mean) of our samples#There are a few problems with the CLT. #1) how large of a sample do we need? #2) It seems that our experiements with the contaminated normal may contradict this.#In the homework you will investigate this further