St Petersburg College Coding Economic R Programming Project


Question Description

I’m working on a Economics question and need guidance to help me study.

No graph needed, just answer the questions directly

Read note VERY carefully, before you start

Unformatted Attachment Preview

ECON 4445 Midterm Review Note Ch2. Time Series Graphics 1. It is always a good idea in Time Series analysis to plot the data first to see how data change over time. 2. To do that, a convenient way in R is to define a time series objects (ts). Two important things to specify: (1) the starting period; (2) frequency. Examples: i. Annual data starts at 2012. y <- ts(c(123,39,78,52,110), start=2012) ii. Monthly data starts at January 2003. y <- ts(c(123,39,78,52,110,321,432,54,42,56,77,88,22,11,232), start=2003, frequency=12) In the above examples, c(.) is a made up series. When dealing with actual data, just replace c(.) with the name of your data. When we are using the data in fpp2, the data is specified as time series object already, so we can skip this step. 3. The most common plotting function we use in this class is autoplot. Example: autoplot(a10) + ggtitle("Antidiabetic drug sales") + ylab("$ million") + xlab("Year") In the above program, we plot the data in a10, define the title of the figure as "Antidiabetic drug sales", label y axis as $million and x axis as Year. 4. There are three patterns that we want to identify in time series. a. Trend: a long-term increase or decrease in the data b. Seasonal: when a time series is affected by seasonal factors such as the time of the year or the day of the week. c. Cyclic: when the data exhibit rises and falls that are not of a fixed frequency. 5. To identify seasonal pattern, two useful plots can be used: ggseasonplot and ggsubseriesplot. Examples: ggseasonplot(a10, year.labels=TRUE, year.labels.left=TRUE) + ylab("$ million") + ggtitle("Seasonal plot: antidiabetic drug sales") ggsubseriesplot(a10) + ylab("$ million") + ggtitle("Seasonal Subseries plot: antidiabetic drug sales") Try to apply the above two programs. Can you identify seasonal pattern based on these plots? 6. Two other very important plots in identifying time series patterns are the lag plot and autocorrelation plot. Example of lag plot: beer2 <- window(ausbeer, start=1992) gglagplot(beer2) These plots are scatter plots of current value (y axis) against lagged value (x axis). For example, the first plot will plot the current value against one lagged value. Four different colors represent four different quarters. From these plots, we can see that lag2 and lag6 plot demonstrate negative correlation, and lag4 and lag8 demonstrate positive correlation. This is evidence of seasonal pattern. For autocorrelation plot, we first need to define autocorrelation ∑𝑇𝑡=𝑘+1(𝑦𝑡 − 𝑦̅)(𝑦𝑡−𝑘 − 𝑦̅) 𝑟𝑘 = ∑𝑇𝑡=1(𝑦𝑡 − 𝑦̅)2 k in the above formula can be equal to 1, 2, 3, …. For example, when k=1, we are calculating the correlation coefficient between current value and one lagged value. ggacf(beer2) We find: (1) 𝑟4 , 𝑟8 , 𝑟12 , 𝑟16 are largely positive. (2) 𝑟2 , 𝑟6 , 𝑟10 , 𝑟14 , 𝑟18 are largely negative. When we observe significant up and down in autocorrelation function with fixed length, we can conclude that the data has seasonal pattern. When data have a trend, the autocorrelations for small lags tend to be large and positive because observations nearby in time are also nearby in size. aelec <- window(elec, start=1980) ggAcf(aelec, lag=48) We find two things in the above ACF plot. First, it decays in a really slow rate. This is an evidence of trend pattern. Second, it has significant up and down with fixed length. This is an evidence of seasonal pattern. The blue dash lines in ACF figure represent critical value. We didn’t get into detail of hypothesis testing. So you are not required to write down a formal null hypothesis. Just understand the blue dash lines as thresholds to conclude whether we have enough evidence (from our sample) to suggest that the k-th autocorrelation is not equal to 0. For example, in the first autocorrelation plot on this page, we see that the first vertical line shows a negative correlation with lag 1, but it doesn’t pass the blue dash line. This means that we don’t have enough evidence (from our sample) to suggest that the first autocorrelation is not equal to 0. Statistically, we say the first autocorrelation is insignificant. The second vertical line shows a negative correlation with lag 2, and this one pass the blue dash line. This means that we have enough evidence (from our sample) to suggest that the second autocorrelation is not equal to 0. Statistically, we say the second autocorrelation is significant. 7. One important thing to define in time series analysis is White Noise, which is a time series that show no autocorrelation. For example, suppose I generate a series of normal random variables set.seed(30) y <- ts(rnorm(50)) autoplot(y) + ggtitle("White noise") ggAcf(y) The following figure demonstrates that the series we generated doesn’t have any significant autocorrelation. This is one important feature of White Noise series. Why we care about White Noise? The reason is: when a series is not White Noise, this means something happened in the past will be able help us to learn about future (that is why we observe some nonzero autocorrelation in a non-White-Noise series). When we can learn from the past, we would like to incorporate that part into our model to improve the forecasting performance. Therefore, a really important criterion for forecasting model is the residuals (the difference between your forecast and the actual value) should look like White Noise. We will talk about this more when we start to introduce our forecasting model. Ch3. The Forecasters’ Toolbox 1. Now we are ready to introduce our basic forecasting methods. We have four methods in this chapter: a. Average method: using the average of series as the forecast. b. Naïve method: using the last observation as the forecast. c. Seasonal naïve method: using the last observation associate with each season (month if monthly data) as the forecast for each season (month). d. Drift method: using extrapolate line between first and last observation as forecast. Each one of them has their own advantage for some data set. Suppose we are trying to forecast a series that does not change too much, doesn’t have trend, seasonal, and cyclic pattern, then average method is a good candidate. Example series can be the US inflation rate from 2010 to 2019. Naïve method works well in short run forecast when the series has trend. Example series can be stock price data. Seasonal naïve works the best when data demonstrates strong seasonality. Example series can be ice cream or beer consumption. Drift method works well in long run forecast when the series has trend. 2. Before we start to forecast our data, one attempt that people sometimes do is to transform the data. The most important transformation we mentioned in class is the Box-Cox transformation. In the following formula, 𝑦𝑡 represents the original series, and 𝑤𝑡 represents the transformed series. The purpose of doing this transformation is to make the variances of your time series at each time point approximately the same, or at least make the variances closer to each other. This is related to an issue in econometrics we called heteroscedasticity. We didn’t get into detail about heteroscedasticity, so you don’t have to worry about it. Intuitively, we would like to make sure that our forecasting model has about the same uncertainty at each time point. This will give us a better measure of the overall uncertainty of the model, so that we can construct a more precise prediction interval (how uncertain our forecast is). Several important points related to Box-Cox transformation: a. First and the most important point, Box-Cox transformation is usually applied to time series where the variance (how large the fluctuation of the data is) changed with the level of the data. For example, suppose your data has positive (negative) trend, so the level of data is increasing (decreasing) over time. If you observe that the variances of your data also increase (decrease) over time, applying Box-Cox transformation is a good idea. b. Box-Cox transformation is driven by one single parameter 𝜆. This parameter can be any real number. It can be selected automatically by R. For example, suppose I want to determine the optimal value of 𝜆 for the data series elec. (lambda <- BoxCox.lambda(elec)) But usually the results are not too sensitive to the value of 𝜆. Therefore, people sometimes just choose a 𝜆 that is easier to interpret. For example, 0 is a common choice, as it represents the log transformation. c. Once the value of 𝜆 is determined, we can use the following code to plot the transformed series: autoplot(BoxCox(elec,lambda)) d. If we want apply Box-Cox transformation with any of our forecast methods above. We can just specify lambda=x, where x denotes the value you chose for 𝜆, in the forecasting function. For example, the following codes are how you can specify lambda in forecasting functions: fc <- meanf(eggs, lambda=0, h=50, level=80) fc <- naive(eggs, lambda=0, h=50, level=80) fc <- snaive(eggs, lambda=0, h=50, level=80) fc <- rwf(eggs, drift=TRUE, lambda=0, h=50, level=80) e. What we did with the above forecasting method with Box-Cox transformation is that we used the transformed series to make forecast, and then transformed everything (including forecast) back. However, the results that R provides by default are the median of forecasts, not the mean. This is because Box-Cox transformation is nonlinear, and nonlinear transformation will change the mean. In the slides I provide a detail derivation of how we can get the mean instead of median which involves Taylor expansion. You are not required to understand those. Just remember that the default results are median, not mean. If we want to get mean forecast from R, we need to specify the following code in the forecasting methods: fc <- meanf(eggs, lambda=0, h=50, level=80, biasadj=TRUE) fc <- naive(eggs, lambda=0, h=50, level=80, biasadj=TRUE) fc <- snaive(eggs, lambda=0, h=50, level=80, biasadj=TRUE) fc <- rwf(eggs, drift=TRUE, lambda=0, h=50, level=80, biasadj=TRUE) 3. As we mentioned in Chapter 2, we would like the residuals of our model (the difference between your forecast and the actual value) to be a White Noise. Thus, the next step is to check whether this is true. Theoretically, we need our residuals to satisfy the following two assumptions: a. Residuals are not autocorrelated. If residuals are autocorrelated, then there is information left in the residual that we should use in the model. b. Residuals have mean zero. If residuals have nonzero mean, then our forecast is going to be biased. This means our forecast is not going to be the average of the future values. And in the best scenario, we would like our residuals to have the following two useful properties (not necessary) c. Residuals have constant variance. We will have a better way to measure the uncertainty if this is true. d. Residuals are normally distributed. Our prediction intervals are calculated based on the assumption that residuals are normally distributed. So if residuals are normally distributed, the prediction intervals are going to be more precise. 4. To test for autocorrelation, we will use a test called Ljung-Box test. This is a joint test to test whether a set of the autocorrelations are equal to zero. Please the understand the difference between using the blue dash line in the ACF plot and Ljung-Box test. When we are using the blue dash line in the ACF plot, we are testing each autocorrelation separately. For example, using the first figure on page 3, we can test whether each of the autocorrelation is equal to zero or not by checking the blue dash line. But when we are testing them separately, we didn’t check the other autocorrelations. For example, when we use the second vertical line in the first figure on page 3 to check whether the second autocorrelation is zero or not, we don’t check any other autocorrelations. This may lead to different conclusion than joint test. 5. To check all the above properties, we can use the checkresiduals function. For example suppose we want to check residuals from our basic forecasting methods, we can use the following code: checkresiduals(meanf(goog200)) checkresiduals(naive(goog200)) checkresiduals(snaive(goog200)) checkresiduals(rwf(goog200, drift=TRUE)) 6. We can evaluate the forecasting performance of our model by two different methods. The first method is used to evaluate the long term forecasting performance. The second method, which is also called cross-validation method, is used to evaluate the short term forecasting performance. a. To perform the first method, we need to split the data into two parts. The first part is called training data, and the second part is called test data. The intuition for doing this is we want to create a situation where we can pretend to perform real forecasting. Think about in reality, when we are making forecast, of course we don’t know the future value. In this situation, we can only evaluate how far away our forecast and the actual observation is after we observe the future value (which can only happen in the future). This makes evaluating the forecasting performance very time consuming, as we always have to wait for the future observation to realize. Therefore, an alternative way is to split the data, using only the first part to construct the forecast of the second part, and then compare the forecast with the actual observation in the second part. Again, this is to mimic the situation when we perform the real forecast. The first method can be performed in the following procedures: i. First, we split the data. For example: beer2 <- window(ausbeer,start=1992,end=c(2007,4)) beer3 <- window(ausbeer,start=2008) The above two functions create two data sets. The first is called beer2, it starts in 1992 first quarter and ends in 2007 fourth quarter. The second is called beer3, it starts in 2008 first quarter and ends at the end of data set. We are going to use the first data set as training data set, and use the second data set as test data set. ii. Second, we apply forecasting method to the first data set. For example, suppose I want to apply average methods, naïve methods, and seasonal naïve method, I can use the following codes to construct the forecast of the next 10 periods after the end of the first data set (2007 fourth quarter). beerfit1 <- meanf(beer2,h=10) beerfit2 <- naive(beer2,h=10) beerfit3 <- snaive(beer2,h=10) Note that we specify h=10 to get the next 10 period forecast is because we our test data set only has 10 observations. If test data set has different amount of observations, then we should change the value of h. iii. Third, we compare the forecast we got in step ii with the actual observation in test data set. It can be done with the following codes. accuracy(beerfit1, beer3) accuracy(beerfit2, beer3) accuracy(beerfit3, beer3) iv. The accuracy function in the third step will produce a lot of different measure to check how far away the forecast is from the actual observation. If you are interested, please take look the textbook to check the definition of each of the measures. In this class, we will focus on a measure called root mean squared error (RMSE). The method which provides the smallest RMSE is the best. b. For the second method, we still split the data into two parts. The first part is still called training data, but we will only use the first observation in the second part to be our test data. And we are going to apply this method multiple times until we use everything but the last observation as our training data, and the last observation as our test data. This method focus on only one step ahead forecast (using up until today’s observation to forecast tomorrow). To perform the second method, the code is much simpler than the first one. For example, we can use the following code to perform this cross-validation method: e <- tsCV(goog200, rwf, drift=TRUE, h=1) sqrt(mean(e^2, na.rm=TRUE)) #> [1] 6.233 The first line of the code specifies that we want to use cross-validation method (CV) to perform model comparison. Inside tsCV function, we first specify the data set. Notice that we don’t have to split data by ourselves in this method. Computer will help us to split. So we can just specify goog200, which is our original data set. The second and third input in tsCV function is to specify what method we want to use to forecast. In this example, we use drift method. If we want to use naïve method instead, for example, we can replace “rwf, drift=TRUE” by “naïve”. We specify h=1 in the last part of the first line is because we want to focus on only one step ahead forecast. The outcome of the first line is going to be the forecast errors with this cross-validation method. “e” is going to be a vector. The second line of the code is to calculate the root mean squared error using the forecast errors we generated from the first line. The second input na.rm=TRUE is necessary for some technical reason, please remember to include it. The outcome we get, 6.233, is the cross-validation RMSE from drift method. Example questions: 1. The pigs data shows the monthly total number of pigs slaughtered in Victoria, Australia, from Jan 1980 to Aug 1995. Use mypigs <- window(pigs, start=1990) to select the data starting from 1990. Use autoplot and ggAcf for mypigs series and compare these to white noise. Ans: Codes and plots are on next page. Time plot: Monthly data in thousands. No features really jump out. Maybe a bit of a trend. The ACF shows significant spikes at lags 1,2 and 3. Also note a large spike at the seasonal lag 12. If we had a longer series with the significance bounds tighter this may have also been significant indicating some seasonality. Definitely not a white noise series. 2. The arrivals data set comprises quarterly international arrivals (in thousands) to Australia from Japan, New Zealand, UK and the US. a. Use autoplot, ggseasonplot and ggsubseriesplot to compare the differences between the arrivals from these four countries. b. Can you identify any unusual observations? Ans: We can directly apply autoplot function, and then we will get the following figure. Or we can use an additional input facet=TRUE. Then we get the following figure The above time plots show: a decrease in arrivals from Japan since mid-1990s, an increase in arrivals from NZ, a downturn of arrivals from the UK since mid-2000s and a flattening out of the arrivals from the US. The seasonal plots show the difference in seasonal patterns from the four source countries. The peaks for UK and the US happen in Q1 and Q4 which include the summer period in Australia, Christmas and New Year’s holiday period with Q2 and Q3 being the troughs. For Japan peaks occur mostly in Q1 but also Q3 reflecting both peak arrivals in summer but also winter which possibly correspond to winter skiing season or visiting northern Australia in during the dry season. The one source country that is noticeably different is New Zealand. Peak arrivals from New Zealand occur during the Q3 followed by Q2 and Q4. Unlike all other source countries, the trough clearly occurs during Q1 the January (summer) quarter. The seasonal plots are also useful, revealing anomalies or one-off events. For example, in the US plot, the peak arrivals for all July quarters occurred in 2000 during the Sydney Olympic games. Unusual observation: 1991:Q3 is unusual for US (Gulf war effect?); 2001:Q3-Q4 are unusual for US (9/11 effect). 3. For the following two series, make a graph of the data. If transforming seems appropriate, do so and describe the effect. dole and bricksq. Ans: The data was transformed using Box-Cox transformation with parameter λ=0.33. The transformation has stabilized the variance. The time series was transformed using a Box-Cox transformation with λ=0.25 ...
Purchase answer to see full attachment
Student has agreed that all tutoring, explanations, and answers provided by the tutor will be used to help in the learning process and in accordance with Studypool's honor code & terms of service.

Final Answer

let me know if you still need help!

R package
.libPaths(new=c("C:/Program Files/R/Libs",Sys.getenv("R_User"))) # adds the users
document folder to the search path
.libPaths() # shows document folder is successfully added to search pass. Not needed for
work around
install.packages("fpp2",lib = Sys.getenv("R_User") ) # installs package „caret“ in document

Eco R studio Homework

1. Consider the data series, austourists, (available in fpp2 package, you can type
help(austourists) to see the description of this data)
(a) Plot the series. Do you observe any pattern (Trend/Seasonal/Cyclic) in this series?
The pattern is seasonal because the time series (the number of Australian tourists) is
affected by seasonal factors, in this case, the year.

(b) Plot the two seasonal plots we discussed in class. Use the plots to support your
answer in part (a).
By using the ggseasonplot, we can see that the graph shows the number of Australian
tourists against individual seasons, in this case, the season is the month. It allows us to
see the seasonal pattern more c...

New York University

Solid work, thanks.

The tutor was great. I’m satisfied with the service.

Goes above and beyond expectations !