Lab Olympic project
A. Summary
School of Mathematics and Science Math 153 Introduction to Statistics Fall 2015
Curvefitting Project  Linear Regression Model
For this assignment you will be collecting data, analyzing whether the data exhibits a linear trend, finding the line of best fit, plotting the data and the line, interpreting the slope, and using the linear equation to make a prediction. You will use r2 (coefficient of determination) and the pvalue to evaluate the strength of your prediction. Finally, you will write a report discussing your findings.
B. Background
The modern day Olympics began in 1896. You can read an overview of the history of the Olympics on Wikipedia, but, in short, the Summer and Winter Olympics have been held every 4 years since. There have been periods where war or politics caused less participation, including boycotts of the Olympics due to the Cold War in 1980 and 1984. The web site www.databaseolympics.com has a “comprehensive medal history of every Summer and Winter Olympics since 1896.” The web site includes all of the gold, silver, and bronze winners up to the Olympics of 2008.
Looking at the data on this web site, there is an interesting trend. It seems that over the modern history of the Olympics, the performance of some atheletes seems to be getting better and better. Your job will be to analyze the Olympic data for a particular event, see if a trend exists, and use the information to make a prediction about the 2012 Olympic games. The analysis you will be doing is called a “linear regression”.
A linear regression is a technique for examining realworld data to determine if the data follows a linear model. In other words, given some data points, can we reliably use a line to model the points and make predictions? There are tools available which will find the best line that approximates a set of data points. The tools provide a measure of how well the line fits the data values. If a line exists that is a good fit, then we can use the line to make predicitions for values we do not have.
There are a variety of reference materials available to help you complete the project.

Chapters 10.1  10.3 of your textbook has material on least squares fitting of a line to data points.

The following YouTube video is an introduction to Linear Regression. This is background/motivation rather than how to actually compute a linear regression.
LINK: Introduction to Linear Regression

You can use Minitab as we did in class to determine the coefficient of determination (r2), pvalue, and least squares fitting for data points. You can also use Excel. Shown below are some tutorial videos on using Excel to compute a linear regression.
LINK: Computing r with ExcelLINK: Computing pvalue with Excel LINK: Least Squares Fitting with Excel
1
C. Instructions
The tasks required to complete the project are listed below.

Select an event from the Summer Olympic games, or an indoor event from the Winter Olympic games. The event you select must be a measured event where the winner is the one with the best time, lifts the most weight, jumps the highest, etc. It cannot be one of the scored events where the winner has the highest or lowest score. For example, ice hockey or figure skating cannot be used as both of these events rely on scoring.

Write your purpose. This can be a one or two sentence summary of the goal of your research.

Go to the Olympics database and note the year and gold medal data for your event for at least 8 different Olympic years. You must have at least 8 data points for your project. Make a table which summarizes the data that you obtained including labels with units. Make sure that if the website gives you mixed units (such as “minutes:seconds”) that you convert this to be solely in minutes or solely in seconds.

Plot the points (x, y) to obtain a scatterplot. Use an appropriate scale on the horizontal and vertical axes and be sure to label the axes carefully, including units.

Find and state the value of r2 and the pvalue. Discuss your findings in a few sentences. Is a line a model to fit to this data? Why or why not? Is the linear relationship very strong, moderately strong, weak, or nonexistent? Is it likely that your data came from solely random chance? If your pvalue or r2 are too poor, you may have to select a new event at this point. Use the criteria we have established in class.

Find the line of best fit (regression line) and graph it on the scatterplot. You can use Minitab to do this or any other software that does a leastsquares fitting. The equation of the line must be included on the graph or in the text.

State the slope of the line of best fit. Carefully interpret the meaning of the slope in a sentence or two. This means more than simply writing “the slope is ”. Give an example, for instance, of what your slope means.

Make a prediction about the gold medal winner in the 2012 Olympics using the line of best fit that you found above. Show calculation work.

Write a brief narrative of a paragraph or two. Summarize your topic and what you did as well as your findings. Be sure to mention any aspect of the linear model project (topic, data, scatterplot, line, r2, or estimate, etc.) that you found particularly important or interesting. Do not just mimic what I have said in my sample project — thoughtfully describe your own project.
Note that while your project must meet our established minimum standard for a least squares fitting, a successful project does not require a perfect fitting. Instead, a successful project is one where students correctly interpret their findings and demonstrate an understanding of what their results mean. Interpret numbers. What does your slope mean? What does your r2 mean? What does your pvalue mean? What are the implications (good or bad) of these numbers on your prediction for the future?
Items #2#9 constitute your project report. Be sure to include your name and a meaningful title at the be ginning of your report. While mathematics can be handwritten, any descriptions, sentences, or paragraphs must be typed. Be sure your scatterplot has labels for the axes including units, and that the line of best fit is graphed in addition to your data. Your thoughts should be in complete sentences using proper English and punctuation. Projects are graded on the basis of completeness, correctness, and strength of the narrative portions.
D. What to Turn In
Turn in your scatterplot, any work you did by hand, and a printout of your report.
2
A. Purpose:
CurveFitting Example Project: Men’s 400 Meter Dash
To analyze the winning times for the Olympic Men’s 400 Meter Dash using a linear model, and predict the winning time in the 2012 Summer Olympics.
B. Data:
The winning times were retrieved from www.databaseolympics.com. The winning times were gathered for the most recent 16 Summer Olympics, postWWII. (More data was available, back to 1896.)
Year Time (secs)
1948 46.20 1952 45.90 1956 46.70 1960 44.90
C. Scatterplot:
Year Time
(secs) (secs)
Year Time (secs)
1996 43.49 2000 43.84 2004 44.00 2008 43.75
Year Time
1964 45.10 1968 43.80 1972 44.66 1976 44.26
1980 44.60 1984 44.27 1988 43.87 1992 43.50
D. Coefficient of Determination and PValue:
r2 = 0.6991 pvalue = 0.0002
The pvalue of 0.0002 is less than 0.05, indicating that it is unlikely that this data came about from random chance. The moderate coefficient of determination (0.6991) means that the line of best fit is a reasonable model for this data. Thus the year can be used to do an approximate prediction of the winning time for the 2012 Olympics using the line of best fit. To do a better prediction, the r2 value should be closer to 0.85 or 1.0 (perfect). Also note that at some point physical limitations of the runners will make the model inaccurate.
3
E. Line of Best Fit (Regression Line)
y = −0.0431x + 129.84 where x = Year and y = Winning Time (in seconds)
The slope is 0.0431 and is negative since the winning times are generally decreasing. The slope indicates that in general, the winning time decreases by 0.0431 second a year, and so the winning time decreases at an average rate of 4(0.0431) = 0.1724 second each 4year Olympic interval.
F. Prediction:
For the 2012 Summer Olympics, substitute x = 2012 to get y = −0.0431(2012) + 129.84 ≈ 43.1 seconds. The regression line predicts a winning time of 43.1 seconds for the Men’s 400 Meter Dash in the 2012 Summer Olympics in London.
G. Narrative:
The data consisted of the winning times for the men’s 400m event in the Summer Olympics, for 1948 through 2008. The data exhibit a moderately strong downward linear trend, looking overall at the 60 year period. The r2 and pvalues indicate that a line is a reasonable model for this data, giving me confidence in the prediction based on the regression line. The r2 value was not near 1.0, however, which means that predictions are not expected to be extremely accurate.
The regression line predicts a winning time of 43.1 seconds for the 2012 Summer Olympics, which would be nearly 0.4 second less than the existing Olympic record of 43.49 seconds, quite a feat! Will the regression line’s prediction be accurate? In the last two decades, there appears to be more of a cyclical (up and down) trend. Could winning times continue to drop at the same average rate? Extensive searches for talented potential athletes and improved fulltime training methods can lead to decreased winning times, but ultimately, there will be a physical limit for humans.
Note that there were some unusual data points of 46.7 seconds in 1956 and 43.80 seconds in 1968, which are far above and far below the regression line. I wondered if these values made the correlation less strong, but when I investigated this, I found the coefficient of determination is r2 = 0.5351 which is not as strong as when we considered the time period going back to 1948 (the pvalue is 0.01 which is still below the 0.05 threshold). The lower coefficient of determination means that the prediction will not be as good. Also, the most recent set of 10 winning times do not visually exhibit as strong a linear trend as the set of 16 winning times dating back to 1948.
H. Conclusion:
I have examined two linear models, using different subsets of the Olympic winning times for the men’s 400 meter dash. The prediction with the strongest coefficient of determination was 43.1 seconds for the 2012 Olympics. I checked on another website (olympic.org) and found that when the race was run in August, 2012, the winning time was 43.94 seconds. This means my estimate was off by 0.84 seconds from the actual time.
Does this mean the trend of the last 50 years is finally coming to an end? It will be interesting to compare these results to the upcoming 2016 results to see what happens!
4
Tutor Answer
Brown University
1271 Tutors
California Institute of Technology
2131 Tutors
Carnegie Mellon University
982 Tutors
Columbia University
1256 Tutors
Dartmouth University
2113 Tutors
Emory University
2279 Tutors
Harvard University
599 Tutors
Massachusetts Institute of Technology
2319 Tutors
New York University
1645 Tutors
Notre Dam University
1911 Tutors
Oklahoma University
2122 Tutors
Pennsylvania State University
932 Tutors
Princeton University
1211 Tutors
Stanford University
983 Tutors
University of California
1282 Tutors
Oxford University
123 Tutors
Yale University
2325 Tutors