Linear Regression and Probable Distribution Lab Report

User Generated

CeboyrzFbyire22

Science

Description

Introduction 1

One of the most important tools in a scientist’s repertoire is the linear regression. This is the technique by which a linear line is found to best describe a set of data that we assume should follow a linear trend. In trying to find a trend that describes a data set, we say that the we are fitting the data with a trend. If the assumption that the data being linear is valid, the result of the linear regression process produces the best fit.

Introduction 2

One of the most famous and important distributions in all of statistical analysis is the Gaussian distribution, also known as the normal distribution, or the colloquial bell curve. This distribution typically occurs if several measurements, that represent the same value, are made. The more measurements that are made, the more the data should reflect the ‘bell’.

Unformatted Attachment Preview

Lab 1 – Probability Distributions Introduction One of the most famous and important distributions in all of statistical analysis is the Gaussian distribution, also known as the normal distribution, or the colloquial bell curve. This distribution typically occurs if several measurements, that represent the same value, are made. The more measurements that are made, the more the data should reflect the ‘bell’. For example, suppose several students are tasked with measuring the length of a piece of paper with a ruler, of which the smallest increment between tick marks is the millimeter. The piece of paper has an exact length, however, with any finite measuring device, when each student makes a measurement, at some point the measurement becomes an estimate. Most students will obtain measurements close to some value, let’s call this xave. Of course, some measurements will be a little larger, and some a little smaller, and the further away from this value we go, the fewer the measurements in that regime should be. So, we’ll have many measurements close to xave, and fewer and fewer the more we deviate from it. The bell curve should also be obtained for systems with specific probabilities assigned to each measurement if these probabilities are not all equal (if they are all equal, after a sufficiently large number of measurements, each value is equally likely and thus a rectangle of probability will emerge as opposed to a bell). Figure 1. A normal distribution, with the average, µ, and standard deviation, σ. These quantities are defined below. One such system is a pair of dice. If I have a fair pair of six-sided dice, the most probable value to roll is a seven, with a probability of 1/6. This is six times greater than rolling either a two or a twelve, both coming in at a probability of 1/36. For the die, we can calculate the probability fairly easily. There are 36 ways to arrange the outcomes of two die. They are summarized in the following table below. For the true bell shape to become evident, we need a large number of measurements. In the course of this lab, you will likely not make a sufficient number for the true bell to become evident. However, we will aggregate the data from all students in the class, and I will then send out a final histogram. However, even if you do not see the true bell curve realized with your data alone, you should be able to make connections with statistical data and the conclusions we can and cannot draw from it. In general, certain parameters of your data should change as more measurements are taken. For example, if we stick with just two dice, the more times you roll, the closer the average ought to approach the value of seven. Also, the more times you roll, the closer a parameter called the ‘standard deviation’ ought to approach a value (we’ll discuss this below). Dice 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 Dice 2 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Total 2 3 4 5 6 7 3 4 5 6 7 8 4 5 6 7 8 9 5 6 7 8 9 10 6 7 8 9 10 11 7 8 9 10 11 12 Total 2 3 4 5 6 7 8 9 10 11 12 Number of Combinations 1 2 3 4 5 6 5 4 3 2 1 Probability 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36 The average and the standard deviation We now seek to define µ and σ from above. The quantity µ is known as the average, and for discrete data it is calculated in the normal way: 𝜇= 1 ∑ 𝑥𝑖 , 𝑁 where N is the total number of measurements, xi is each individual measurement, and the index i runs from 1 to N. So, what this equation says to do is take the first measurement and add it to the 2nd, and add the result to the third, and then the fourth, and so on until all of the measurements have been added. This total sum is then divided by the total number of measurements N. The standard deviation is a way to measure the ‘spread’ of the data. For normal distributions, the standard deviation represents distance on both sides of the average, at which 68.27% of all of the data are encompassed. This would represent the area between the two red lines in Figure 1. For two standard deviations, 95% of the data are encompassed. For discrete data, the standard deviation is given by the formula 𝜎=√ 1 ∑(𝑥𝑖 − 𝜇)2 . 𝑁−1 Excel will be able to handle this calculation, so you don’t have to do it by hand. Below are four sample distributions from a computer simulation for two dice being thrown. They number of rolls is 36 for the first figure, 360 for the second, 3600 for the third, and 36000 for the fourth. Note that there is little-to-no bell shape for the first plot, and only by the 36000 trial mark does the true bell start to be apparent. Note that the theoretical average for a true bell distribution of rolls of 2 dice is 7 and the standard deviation is 2.415. Figure 1. Sample distribution after 36 rolls. Figure 2. Sample distribution after 360 rolls. Figure 3. Sample distribution after 3600 rolls. Figure 4. Sample distribution after 36000 rolls. Procedure - Part 1: Processes with Differing Probability Obtain two dice. These may be physical or virtual dice. If you opt for the virtual option, please use the following site: https://www.random.org/dice/. You will perform three trials. In the first trial, roll your dice 10 times. Record the result of each trial in a column in Excel labeled ‘N = 10’. Next, label a second column ‘N = 30’, and roll the dice 30 times, recording the result each time. Finally, repeat once again with a column labeled, ‘N = 60’. Roll the dice 60 times and record the results for each roll. STOP: Make a prediction about what you suspect the general trend, if any of your trials should be in regard to the average and the standard deviation. Build a histogram for each set. If you’re unsure about how to do this, there will be a tutorial video posted in Blackboard. Also, for each distribution, calculate the average and standard deviation using the formulas in Excel. Be sure to attach your histograms to your report. Procedure - Part 2: Random Processes Another system in which a bell curve is generated are measurements where a ‘target’ is attempted to be measured. However, no measurement is perfect, and even the most careful research cannot correct or account for all of these imperfections. The best an experimentalist can hope for is to correct for as much of the errors caused by biases in equipment (systematic errors) and minimize the random errors with careful techniques. However, we can never eliminate all uncertainty. In this exercise, you will be given a situation in which making a completely accurate measurement is quite unlikely over several trials. What you are to do is stop a stopwatch as close to the 2.00 second mark as possible. Do this for 15 trials, and record in a column in Excel. Use Excel to compute the average and the standard deviation. Record your measurements in the shared Google Sheets. We will keep a running tally of the entire class. Analysis and Discussion Questions 1. What was your prediction about the behavior of the averages for your data? What about the standard deviation? 2. Does your data support or reject those predictions? Or are your results uncertain? 3. The average and standard deviation are statistically two of the most used quantities to describe a set of data. Do they always tell the whole story? 4. Suppose you are playing Monopoly®. You know that the most Commonly rolled number is a 7. But you notice that 13 rolls go by without a 7 being rolled. Should this surprise you? Does this indicate the dice may not be ‘fair’? 5. Along with dice, do some research and find one example of a naturally occuring normal distribution. This can come from the hard sciences, social sciences, etc… An example would be shoe sizes of adults above the age of (so don’t use shoe sizes as your example.) List your example, and, if possible, see if its average and standard deviation are listed. Include a graph of the distribution if one is readily available. 6. From Part 2, what were the average, and what was the standard deviation? 7. If the standard deviation represents a reasonable estimate of the uncertainty, does the target of 2.00 fall within the bounds of your uncertainty? If so, you can claim your measurement was ‘accurate,’ if not, can you give any explanation as to why it might not? Lab 2 - Linear Regression Introduction One of the most important tools in a scientist’s repertoire is the linear regression. This is the technique by which a linear line is found to best describe a set of data that we assume should follow a linear trend. In trying to find a trend that describes a data set, we say that the we are fitting the data with a trend. If the assumption that the data being linear is valid, the result of the linear regression process produces the best fit. There are multiple important parameters associated with determining the best fit, and how to evaluate its effectiveness in doing so. One of the most important is the square of the correlation coefficient. If you’re familiar with trendlines in Excel, then you may have seen or used the R2 value. This is basically a mathematical attempt to quantify how good the fit is at describing the data. However, as we’ll see, obtaining a best fit and achieving a high R2 value alone may not be enough to gauge how useful the fit is. The linear model is one in which the data is assumed to follow a trend of the form 𝑦 = 𝑚𝑥 + 𝑏 where y is the dependent variable, m is the slope, x is the independent variable, and b is the y-intercept, sometimes called the vertical offset. So how does the method work? If you make a plot in Excel, and fit the data with a linear trendline, how does Excel choose the appropriate equation? It starts with analyzing what are called the residuals. Suppose the following data set is plotted, and a trendline drawn through it. The residuals are the distance from each data point to the trend. Figure 1. Data with a trendline drawn and the residuals shown. One might be tempted to say that if we can find the trend for which the sum of the residuals is minimized, then this would be the correct trend. However, if we consider the algorithm data-fit, where we take the yvalue of each data point and subtract the fit, then for any reasonable fit, this sum will be near zero from the start. Notice that some data are above the fit and some below, so just taking data – fit as a metric will entail some positive numbers and some negative. As a result, adding them up will give values close to zero by default. Another thought would be, well, instead of just the residuals, what if we took the absolute value? What if we found a fit for which the sum of the absolute value of the residuals was minimized? This would be better, but absolute values can be cumbersome to work with mathematically. So, what we do is work with a system where we square the residuals making all quantities positive (so the expression has the form (data-fit)2) and find a fit that minimizes the sum of these squares. Fortunately it is possible to derive a closed algorithm for this. We won’t work through the derivation, all you need to know is that it is the result of this algorithm that Excel is using when you ask it to fit a data set with a linear trendline. Procedure Part 1 - PHET Simulation Navigate to this website: https://phet.colorado.edu/en/simulation/least-squares-regression and click the play button. Once the simulation opens, play around with the buttons and controls, and adding data to the plot. To add data to the plot, click on the bowl of dots in the lower left, and drag them onto the graph. Play with the sliders on the right side of the screen, labeled a and b. These represent the slope and intercept respectively. Randomly throw some data on the screen and adjust the a and b sliders to see how the fit responds. Once you are comfortable with the controls, plot the following sets of data found below. For each set of data, perform the following procedure: 1) Be sure the Best-Fit Line box in the upper left in NOT checked. You may need to click on the plus sign to expand the menu. 2) Plot the data by dragging the points from the bowl onto the plot, and then adjusting the a and b parameters until you believe you have achieved the best fit. 3) Record your values for a and b below in the appropriate spaces. 4) Now go up to the Best-Fit Line box and check the Best-Fit Line. This will cause the true best fit to appear. Record the values of the slope and intercept from the best fit. Also record the Correlation Coefficient (r): 5) Somehow, whether with the Print Screen (Prt Sc) button on your computer and a program like Paint, or taking a picture with your phone, record a picture of your plot with both My Line and Best-Fit Line trends on it. Include these shots in your final report. Data Set 1: (1,2), (2,4), (3,6), (4,8), (5, 10), (6, 12), (7,14) My Line: a: __________ b: __________ Best-Fit a: __________ b: __________ r: _________ Now uncheck Best-Fit while you enter the new data set Data Set 2: (1,1), (3,4), (6, 5), (7,9), (10, 8), (11, 13), (15, 10) My Line: a: __________ b: __________ Best-Fit a: __________ b: __________ r: __________ Now uncheck Best-Fit while you enter the new data set Data Set 3: (1,1), (1,2), (1,3), (2,1), (2,2), (2,3), (12, 10), (15, 15) My Line: a: __________ b: __________ Best-Fit a: __________ b: __________ Now uncheck Best-Fit while you enter the new data set Data Set 4: (0, 5.5), (1, 5), (0, 4.5), (4, 9), (5, 10), (6,9), (8,7), (9,6), (9,12), (10,11), (11,11), (13, 10), (15,11), (17, 14) My Line: a: __________ b: __________ Best-Fit a: __________ b: __________ Now uncheck Best-Fit while you enter the new data set Data Set 5: (0.5, 5.5), (1, 4.5), (1, 6.5), (2.5, 4.5), (2.5, 6.5), (3, 5.5), (11, 10.5), (10.5, 11), (11, 12), (12,12), (12.5, 11.5), (11, 10.5), (16,14), (16.5, 15), (17, 16), (18, 15.5), (18.5, 15), (19,15) My Line: a: __________ b: __________ Best-Fit a: __________ b: __________ Part 2 - Excel Now transfer each of the data sets in Excel (or your spreadsheet of choice), and perform a linear regression. In the case of Excel, the preferred method is that of the LINEST function. Record the uncertainties on the slopes for the various data sets below. Uncertainties on slope: Data Set 1: _________ Data Set 5: _________ Data Set 2: ________ Data Set 3: _________ Data Set 4: ________ Discussion Questions Part 1: 1. Comment on the overall agreement or lack thereof between the slope you chose vs. the one from the best fit. Can you account for the discrepancies? 2. Two of your data sets are not like the others; what is meant by this is, two of them seem reasonably unphysical, or, when doing an experiment, real data would not likely cluster in these ways. Which two are they? 3. The quality of a fit to a data set is usually quantified by the r value (or the R2 in the case of Excel; the R2 is just the square of the r value). The r value runs from 1 for a perfect positive linear trend, and -1 for a perfect trend with a negative slope; a value of 0 means no correlation whatsoever. Given that, which trends had the highest r-value? Can you explain why it was so high? 4. From the answers to 2 and 3, is it always possible to draw conclusions about the significance of a relationship by knowing the r-value alone? Why or why not? 5. Explain this statement: correlation does not mean causation. Can you think of a real-life situation where two variables are correlated but without a causal relationship? Part 2: 6. From the uncertainties determined from the linear regressions, can you draw any general conclusions about any correlation between the r-value of a set of data and the uncertainty on its slope? 7. Does either metric alone tell the whole statistical story of the quality of a set of data? Explain.
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Here you go! In lab 1, there is a highlighted sentence reminding you to add the data values to the Google Doc for your class. Make sure you delete the sentence before you turn in the lab!Let me know if you need any help in the future 😀

DICE 1

DICE 2
6
1
1
1
2
6
4
5
4
3

n=10

SUM

DICE 1

2
4
3
6
2
4
1
2
6
5

8
5
4
7
4
10
5
7
10
8

AVERAGE
STD. DEV.

6.8
2.250926
2.135416

DICE 2
6
6
1
6
6
3
3
4
3
2
2
2
1
3
5
2
2
6
6
3
3
1
4
4
6
6
5
3
2
6

n=30

SUM

DICE 1

3
4
2
1
5
2
6
5
2
1
1
4
6
1
6
2
3
5
6
5
2
5
5
1
3
5
3
3
1
1

9
10
3
7
11
5
9
9
5
3
3
6
7
4
11
4
5
11
12
8
5
6
9
5
9
11
8
6
3
7

AVERAGE
STD. DEV.

7.033333
2.785224
2.73841

5
1
6
1
3
6
6
3
1
3
1
5
1
6
6
2
1
1
2
2
2
6
3
6
6
6
2
5
4
1
1
2
6
3
3
3
6
4
5
6
4
4
3
4
2
3

5
4
4
3
2
5
6
2
4
1
5
5
3
4
n=60

DICE 2

SUM
1
1
2
1
3
5
5
5
1
6
1
5
5
3
6
6
2
4
5
1
4
4
6
2
1
5
5
6
1
6
4
1
5
3
3
6
2
2
2
6
5
2
2
3
1
2

6
2
8
2
6
11
11
8
2
9
2
10
6
9
12
8
3
5
7
3
6
10
9
8
7
11
7
11
5
7
5
3
11
6
6
9
8
6
7
12
9
6
5
7
3
5

4
5
6
3
1
5
2
2
5
6
2
4
2
6

9
9
10
6
3
10
8
4
9
7
7
9
5
10

AVERAGE
STD. DEV.

7.083333
2.726466
2.70365

TRIAL

TIME
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

AVERAGE
STANDARD DEVIATION

2.28
1.96
1.97
2.37
1.77
2.13
2.06
1.5
2.35
1.81
2.21
1.97
1.75
2.36
1.89
2.025333
0.259033

X
1
2
3
4
5
6
7

Y
2
4
6
8
10
12
14

SLOPE
UNCERTAINTY

2 0 Y-INTERCEPT
0 0 UNCERTAINTY

Y
16
y = 2x
R² = 1

14
12
10
8
6
4
2
0
0

2

4

6

8

X Y
1 1
3 4
6 5
7 9
10 8
11 13
15 10

SLOPE
UNCERTAINTY

0.718813906 1.700408998 Y-INTERCEPT
0.195383229 1.717658786 UNCERTAINTY

Y
14
12
10

y = 0.7188x + 1.7004
R² = 0.7302

8
6
4
2
0
0

2

4

6

8

10

12

14

16

X Y
1 1
1 2
1 3
2 1
2 2
2 3
12 10
15 15

SLOPE
UNCERTAINTY

0.75 0.892857143 Y-INTERCEPT
0.100593477 0.479423548 UNCERTAINTY

Y
16
y = 0.8851x + 0.6419
R² = 0.9563

14
12
10
8
6
4

2
0
0

2

4

6

8

10

12

14

16

X Y
0 5.5
1
5
0 4.5
4
9
5 10
6
9
8
7
9
6
9 12
10 11
11 11
13 10
15 11
17 14

SLOPE
UNCERTAINTY

0.495215311 5.444976077 Y-INTERCEPT
0.221985851 0.999816837 UNCERTAINTY

Y
16
y = 0.4341x + 5.58
R² = 0.6454

14
12
10
8
6
4

2
0
0

2

4

6

8

10

12

14

16

18

X
0.5
1
1
2.5
2.5
3
11
10.5
11
12
12.5
11
16
16.5
17
18
18.5
19

Y
5.5
4.5
6.5
4.5
6.5
5.5
10.5
11
12
12
11.5
10.5
14
15
16
15.5
15
15

SLOPE
UNCERTAINTY

0.503629764 4.667422868 Y-INTERCEPT
0.117818558 0.535764542 UNCERTAINTY

Y
18
y = 0.599x + 4.5048
R² = 0.9613

16
14
12
10
8
6
4
2
0
0

2

4

6

8

10

12

14

16

18

20


Lab 1 – Probability Distributions
Introduction
One of the most famous and important distributions in all of statistical analysis is the Gaussian
distribution, also known as the normal distribution, or the colloquial bell curve. This distribution
typically occurs if several measurements, that represent the same value, are made. The more
measurements that are made, the more the data should reflect the ‘bell’.
For example, suppose several students are tasked with measuring the length of a piece of paper
with a ruler, of which the smallest increment between tick marks is the millimeter. The piece of
paper has an exact length, however, with any finite measuring device, when each student makes
a measurement, at some point the measurement becomes an estimate. Most students will obtain
measurements close to some value, let’s call this xave. Of course, some measurements will be a
little larger, and some a little smaller, and the further away from this value we go, the fewer the
measurements in that regime should be. So, we’ll have many measurements close to xave, and
fewer and fewer the more we deviate from it.
The bell curve should also be obtained for systems with specific probabilities assigned to each
measurement if these probabilities are not all equal (if they are all equal, after a sufficiently large
number of measurements, each value is equally likely and thus a rectangle of probability will
emerge as opposed to a bell).

Figure 1. A normal distribution, with the average, µ, and standard deviation, σ. These quantities are defined below.

One such system is a pair of dice. If I have a fair pair of six-sided dice, the most probable value
to roll is a seven, with a probability of 1/6. This is six times greater than rolling either a two or a

twelve, both coming in at a probability of 1/36.
For the die, we can calculate the probability fairly easily. There are 36 ways to arrange the
outcomes of two die. They are summarized in the following table below.
For the true bell shape to become evident, we need a large number of measurements. In the
course of this lab, you will likely not make a sufficient number for the true bell to become
evident. However, we will aggregate the data from all students in the class, and I will then send
out a final histogram. However, even if you do not see the true bell curve realized with your data
alone, you should be able to make connections with statistical data and the conclusions we can
and cannot draw from it. In general, certain parameters of your data should change as more
measurements are taken. For example, if we stick with just two dice, the more times you roll, the
closer the average ought to approach the value of seven. Also, the more times you roll, the closer
a parameter called the ‘standard deviation’ ought to approach a value (we’ll discuss this below).

Dice 1
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
4
4
4
4
4
4
5
5
5
5
5
5
6
6
6
6
6
6

Dice 2
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5...


Anonymous
Nice! Really impressed with the quality.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags