To identify patterns and trends of monthly visitors to Yellowstone National Park and forecast next 12-month visitors
I was very lucky to have a chance to visit the Yellowstone National Park with a friend during the week of July 4th, 2019. I had such a wonderful experience visiting Park. I remember how much I enjoyed walking around the park and seeing those beautiful and colorful hot springs and geysers basins, and how excited I was when we run into our natural friends, bisons and deers, multiple times.
The park is open to the public 24/7 all year round. Every year there are more than 4 millions visitors to the Park, and number is growing year by year(might not held true for 2020 due to Covid). It would be great if we can make up-to-date forecast of visitor volumn to help the park service team to better plan and allocate resources. Being able to predict visitor volumn will help us not only plan for construction and/or maintaining of roads, lodges and campgrouds, but also reserve time for recovery of the Park. Ultmiately we hope to preserve the ecosystem (the natures and wildlife) and sustain the beauty of the Park to all visitors for many generations to come. I had a chance to work with a team to perform time series analysis on number of monthly visitors to Yellowstone National Park and build machine learning models to forecast next 12 month’s number of visitors.
Data Source: * Monthly Visitors to Yellowstone Nation Park: Integrated Resource Management Applications (IRMA) * Month Temperature: National Center for Environmental Information * Gas Price: U.S. Energy Information Administration
Load Necessary Libraries
library("fma")
library("fpp")
library("forecast")
library("TSA")
library("zoo")
library(lubridate)
library("ggplot2")
library("MLmetrics")
library("imputeTS")
library("gdata")
library("readxl")
library("GGally")
#library("MLmetrics")
library(dplyr)
library(Metrics)
library(dygraphs)
library("tseries")
accuracy <- forecast::accuracy
library("urca")
library(tidyverse)
library(tsibble)
library("vars")
Read the raw dataset and convert it to time series data
rawdata <- read.csv("FINAL_CLEAN_DATASET.csv")
head(rawdata)
dfts <- ts(rawdata$Visitors , start=c(1979, 1), end = c(2018,12),frequency = 12)
head(dfts,24)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1979 23605 31992 17813 34095 108952 313924 493902 493915 262786 76542
## 1980 24757 36373 28112 32466 101988 325211 557810 523549 229396 93240
## Nov Dec
## 1979 12698 22684
## 1980 25773 21594
save(dfts, file= "ts_yellow_stone_park.Rdata")
#load(file = "ts_yellow_stone_park.Rdata")
Split data into train and test set * Train Set : 1979-2017 (39 years : 468 months ) * Test Set : 2018 (1 year : 12 months)
train <- window(dfts, start = c(1979, 1), end = c(2017, 12), frequency = 12 )
test <- window(dfts, start = c(2018, 1), end = c(2018, 12), frequency = 12 )
First Impression of the data : * Compute mean and variance * Create time series plot for number of visitors (monthly)
We can see there is strong seasonality, the variance of data increases over time, and there is tiny upward trend in the data.
mean(train)
## [1] 245301.8
var(train)
## [1] 80991288221
autoplot(train, main = "Monthly Number of Visitors to Yellow Stone National Park",
xlab = "Year", ylab="Visitors")
Let’s take a closer look by breaking down the data and see what are the components. I created Additive and Multiplicative Decomposition plots. Both Decomposition plots show there is a upward linear trend(second component), strong seasonality (third component) and some randomnese (4th component) in the data. Compare between additive and multiplicative decomposition, I think the multiplicative one is a better representation of the data as the last component “random” looks more like a white nosie than that of the additive decomposition.
decom <- decompose(train)
plot(decom)
decom_add <- decompose(train, type = c("additive"))
plot(decom_add)
decom_mul <- decompose(train, type = c("multiplicative"))
plot(decom_mul)
Let’s take a look at the seasonality component of the data by creating seasonal plot Here is what we see: - Repeated seasonality within each year - November to April : low season - May to October : busy season with peak in July
seasonplot(train, type = "o",
main="Seasonal Plot : Visitors to Yellow Stone Park", ylab = "Visitors",
xlab = "Month", col = 4)
We noticed that the variance of the data increases over years. Let’s address this with Box-cox transformation.
lambda_t <- BoxCox.lambda(train)
lambda_t
## [1] 0.1140685
cbind("Before Transformation" = train,
"Box-cox transformed" = BoxCox(train, lambda = "auto")) %>%
autoplot(facets=TRUE) +
xlab("Year") + ylab("") +
ggtitle("Number of Visitors to Yellow Stone Park")
Here is the ADF and PACF plot of the data. We can see that the data is not stationary as there is sinusoid pattern(sine wave) in ACF plot and the values at lag-1 of lag-2 of PACF are significant. We will come back to this with differencing to remove/reduce non-stationarity.
# Display the data in time-series plot and ACF and PACF plot
ggtsdisplay(train,main='No.Of Visitors')
# Another way to display the data in time-series plot and ACF and PACF plot
tsdisplay(train)
# ACF plot
acf(train, 48)
# PACF plot
pacf(train, 48)
As the p-value of ADF test is smaller than 0.01 (<0.05), we reject the null hypothesis and accept the alternative hypothesis, meaning that the ADF test tells us that the data is level or trend stationary.
adf.test(train)
## Warning in adf.test(train): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: train
## Dickey-Fuller = -22.189, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
The p-value of the KPSS test is slightly greater than 0.05, which suggest that there is no enough evidence to reject the null hypothesis if we are following the p-value as of 0.05 threshold. The KPSS test suggests that data is level or trend stationary.
kpss.test(train,null = c("Level", "Trend"))
##
## KPSS Test for Level Stationarity
##
## data: train
## KPSS Level = 0.42703, Truncation lag parameter = 5, p-value =
## 0.06551
The p-value is smaller than 0.01 once we specify the long version of the truncation lag is used by changing lshort parameter from TRUE(default) to FALSE. Therefore, KPSS test with longer lag suggests the time series is not stationary.
kpss.test(train,null = c("Level", "Trend"), lshort='FALSE')
## Warning in kpss.test(train, null = c("Level", "Trend"), lshort = "FALSE"):
## p-value smaller than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: train
## KPSS Level = 0.84359, Truncation lag parameter = 17, p-value =
## 0.01
We see conflicting results from the three tests above(ADF test, KPSS test and KPSS test with longer lag). While the ACF and PACF plots shows there are signs of auto-coorelation with the data. There is a seasonal pattern which needs to be extracted from the data to improve modelling. We will proceed with differencing to remove the seasonal pattern.
Here we can see a sine-curve pattern at the ACF plot with significant spikes at lag-1, lag-12, 24, 36, 48 which indicates at least the time series has seasonal component and is Not stationary.
tsdisplay(train, lag.max = 48)
Here I only applied normal differencing with order equal to 1 on the data, however seasonality remains in the data after applying it.
dff.1 <- diff(train, lag = 1, differences =1 )
tsdisplay(dff.1, main="Differencing : nonseasonal order = 1 , seasonal order = 0")
Since seasonality is strong in this time series, it is recommended that seasonal differencing be done first then normal differencing, because sometimes the resulting series will be stationary and there will be no need for further first difference. Let’s try this.
Applying 1st order seasonal differencing removes the seasonality from the time series. We see there is a significant spikes at lag 1 in the PACF plot , none beyond lag 1 and the ACF is exponentially decayig after we applied 1st order seasonal differencing on the time series. This signals the data may contains an autoregressive component with order of 1.
dff.2 <- diff(train, lag = 12, differences =1 )
tsdisplay(dff.2, main="Differencing nonseasonal order = 0, seasonal order = 1")
Let’s apply another differencing, 1st order normal differencing on the top of 1st-order-seasonal-differenced data.
We see that the spike at lag-1 on the ACF plot actually points downward, autocorrelation of lag-1 changed from positive to negative, after applying the second differencing (1st order normal differencing), indicating that we might be over-differencing the time series. I decided to pursue only one differencing, first order seasonal differencing, as the differencing of choice.
# 1st order Seasonal Differencing followed by 1st order Normal Differencing
dff.3 <- diff(dff.2, lag = 1, differences = 1)
tsdisplay(dff.3 , main="Differencing : nonseasonal order = 1 , seasonal order = 1")
dff <- diff(train, lag = 12, differences = 1)
tsdisplay(dff, main="1st order Seasonal Differenced", lag.max = 48)
For this project, I choose the following metrics: * RMSE as the primary metrics * MAPE and AICc(if available) as complementary
Train Seasonal naive model
fc.snaive <- snaive(train, h=12)
fc.snaive$model
## Call: snaive(y = train, h = 12)
##
## Residual sd: 36371.7189
Forecast by the seasonal naive method
fc.snaive$mean
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 2018 29518 32275 23897 45160 419635 803652 962404 916166 640068 211987
## Nov Dec
## 2018 10468 21294
autoplot(fc.snaive)
check model fitness We forecast the number of visitors for next 12 month and compare the forecasted value with our test set.
#check forecast errors
Box.test(fc.snaive, lag = 12,type = c("Ljung-Box"))
##
## Box-Ljung test
##
## data: fc.snaive
## X-squared = Inf, df = 12, p-value < 2.2e-16
checkresiduals(fc.snaive, lag=12)
##
## Ljung-Box test
##
## data: Residuals from Seasonal naive method
## Q* = 210.29, df = 12, p-value < 2.2e-16
##
## Model df: 0. Total lags used: 12
accuracy(fc.snaive, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 4876.351 36657.60 21023.11 -2.711473 17.332222 1.000000
## Test set -127.000 39734.67 22027.33 1.316248 8.099741 1.047768
## ACF1 Theil's U
## Training set 0.5400440 NA
## Test set -0.3172638 0.07413897
How was our base model(seasonal naive) performing ? As the p-value of Ljung_box test is smaller than 0.05, we rejects the null hypothesis that the time series isn’t autocorrelated, and therefore the residuals of snaive mode doesn’t qualified as white noise. Therefore we note that there is room for improvement because there are still patterns within the data that’s not captured by the seasonal naive model.
fit_ses <- ses(train, h = 12)
fit_ses$model
## Simple exponential smoothing
##
## Call:
## ses(y = train, h = 12)
##
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l = 185751.3911
##
## sigma: 182850.1
##
## AIC AICc BIC
## 14222.45 14222.50 14234.90
#check model fitness - ses
checkresiduals(fit_ses, lag=12)
##
## Ljung-Box test
##
## data: Residuals from Simple exponential smoothing
## Q* = 1230.7, df = 10, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 12
accuracy(fit_ses, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -351.4421 182459 130902.1 -106.32939 154.67501 6.226582
## Test set 321623.7473 484053 322864.4 53.41672 62.37515 15.357598
## ACF1 Theil's U
## Training set 0.6423947 NA
## Test set 0.7323945 1.053652
fit_holt <- holt(train, h=12)
fit_holt$model
## Holt's method
##
## Call:
## holt(y = train, h = 12)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.0067
##
## Initial states:
## l = -12474.4804
## b = 36708.3201
##
## sigma: 184245.2
##
## AIC AICc BIC
## 14231.55 14231.68 14252.29
checkresiduals(fit_holt, lag=12)
##
## Ljung-Box test
##
## data: Residuals from Holt's method
## Q* = 1229.6, df = 8, p-value < 2.2e-16
##
## Model df: 4. Total lags used: 12
accuracy(fit_holt, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -11777.05 183456.1 133244.9 -135.69544 171.11398 6.338022
## Test set 321904.37 484261.0 323065.9 54.04817 62.43508 15.367182
## ACF1 Theil's U
## Training set 0.6441791 NA
## Test set 0.7324382 1.054261
fit_hwa <- hw(train, seasonal ="additive", h=12)
fit_hwa$model
## Holt-Winters' additive method
##
## Call:
## hw(y = train, h = 12, seasonal = "additive")
##
## Smoothing parameters:
## alpha = 0.6892
## beta = 1e-04
## gamma = 0.2491
##
## Initial states:
## l = 185466.1856
## b = 395.5252
## s = -223719.6 -229754.7 -108230.4 174130.3 444516.3 532535.6
## 304503.9 -23480.65 -217854.5 -226139.9 -209993 -216513.3
##
## sigma: 42674.06
##
## AIC AICc BIC
## 12874.22 12875.58 12944.75
checkresiduals(fit_hwa)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' additive method
## Q* = 797.06, df = 8, p-value < 2.2e-16
##
## Model df: 16. Total lags used: 24
accuracy(fit_hwa, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -95.87312 41938.24 30673.87 -19.71565 70.10240 1.459055
## Test set 3763.37805 54314.22 43441.55 -44.00666 82.91316 2.066372
## ACF1 Theil's U
## Training set 0.4106732 NA
## Test set 0.4411813 0.3726546
fit_hwm <- hw(train, seasonal ="multiplicative", h=12)
fit_hwm$model
## Holt-Winters' multiplicative method
##
## Call:
## hw(y = train, h = 12, seasonal = "multiplicative")
##
## Smoothing parameters:
## alpha = 0.0259
## beta = 8e-04
## gamma = 0.4204
##
## Initial states:
## l = 178505.3378
## b = 2471.0002
## s = 0.211 0.1744 0.4579 1.3074 2.8496 3.1763
## 2.0406 0.7902 0.2277 0.2081 0.3821 0.1747
##
## sigma: 0.2239
##
## AIC AICc BIC
## 12240.96 12242.32 12311.48
checkresiduals(fit_hwm)
##
## Ljung-Box test
##
## data: Residuals from Holt-Winters' multiplicative method
## Q* = 149.52, df = 8, p-value < 2.2e-16
##
## Model df: 16. Total lags used: 24
accuracy(fit_hwm, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -822.4685 34492.88 19821.01 -9.562935 18.933591 0.9428201
## Test set -17003.4617 43541.78 25905.37 -6.710215 8.773936 1.2322334
## ACF1 Theil's U
## Training set 0.5876556 NA
## Test set 0.2041875 0.07354712
fit_hwad <- hw(train, seasonal ="additive", h=12, damped =TRUE)
fit_hwad$model
## Damped Holt-Winters' additive method
##
## Call:
## hw(y = train, h = 12, seasonal = "additive", damped = TRUE)
##
## Smoothing parameters:
## alpha = 0.0881
## beta = 1e-04
## gamma = 0.5801
## phi = 0.98
##
## Initial states:
## l = 178757.1427
## b = 813.1658
## s = -223719.5 -229754.7 -108230.5 174130.7 444515.8 532535.5
## 304504.5 -23480.72 -218189.7 -226169.8 -209993.7 -216147.9
##
## sigma: 38113.43
##
## AIC AICc BIC
## 12769.40 12770.92 12844.07
checkresiduals(fit_hwad)
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' additive method
## Q* = 359.87, df = 7, p-value < 2.2e-16
##
## Model df: 17. Total lags used: 24
accuracy(fit_hwad, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 1920.405 37414.79 24882.90 -9.814040 40.94888 1.1835977
## Test set -1648.958 31452.33 19129.21 -5.523559 10.32900 0.9099136
## ACF1 Theil's U
## Training set 0.5697702 NA
## Test set -0.2911233 0.09409559
fit_hwmd <- hw(train, seasonal ="multiplicative", h=12, damped =TRUE)
fit_hwmd$model
## Damped Holt-Winters' multiplicative method
##
## Call:
## hw(y = train, h = 12, seasonal = "multiplicative", damped = TRUE)
##
## Smoothing parameters:
## alpha = 0.031
## beta = 1e-04
## gamma = 0.4954
## phi = 0.9722
##
## Initial states:
## l = 178505.7964
## b = 2470.5481
## s = 0.2111 0.0251 0.3669 1.4643 2.7496 3.1089
## 2.1438 0.8027 0.2369 0.1557 0.3581 0.3769
##
## sigma: 0.2607
##
## AIC AICc BIC
## 12360.60 12362.12 12435.27
checkresiduals(fit_hwmd)
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' multiplicative method
## Q* = 51.941, df = 7, p-value = 5.997e-09
##
## Model df: 17. Total lags used: 24
accuracy(fit_hwmd, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 2395.3967 33439.33 19288.06 -6.714892 18.197791 0.9174695
## Test set -972.6206 31175.87 19004.29 -2.013852 6.355149 0.9039715
## ACF1 Theil's U
## Training set 0.5620355 NA
## Test set -0.2160774 0.09439248
Compare Test RMSE of holt’s linear , holt-winters’ seasonal models with the seasonal naive model(baseline) * Noticed that the simple exponential smooothing(ses) model and holt’s linear trend model actually are not better than the base model as the RMSEs of ses and holt models are greater than the base model.
* Tow models outperforms the baseline model:Holt-Winters’ additive damped method and Holt-Winters’ multiplicative damped method
modelname <- c("snaive","ses","holt", "hw-add","hw-add-damp", "hw-mul","hw-mul-damp")
as.data.frame( cbind (modelname ,
"RMSE"= round(c(accuracy(fc.snaive, test)["Test set",2],
accuracy(fit_ses, test)["Test set",2],
accuracy(fit_holt, test)["Test set",2],
accuracy(fit_hwa, test)["Test set",2],
accuracy(fit_hwad, test)["Test set",2],
accuracy(fit_hwm, test)["Test set",2],
accuracy(fit_hwmd, test)["Test set",2]),2),
"MAPE(%)"= round(c(accuracy(fc.snaive, test)["Test set",5],
accuracy(fit_ses, test)["Test set",5],
accuracy(fit_holt, test)["Test set",5],
accuracy(fit_hwa, test)["Test set",5],
accuracy(fit_hwad, test)["Test set",5],
accuracy(fit_hwm, test)["Test set",5],
accuracy(fit_hwmd, test)["Test set",5]),2) ))
The best model from the 7 models above is Holt-Winters’ multiplicative damped method. However the residuals of this model doesn’t pass the Ljung-Box test, indicating the residuals is not white noise and there are still patterns remains in the residuals.
checkresiduals(fit_hwmd)
##
## Ljung-Box test
##
## data: Residuals from Damped Holt-Winters' multiplicative method
## Q* = 51.941, df = 7, p-value = 5.997e-09
##
## Model df: 17. Total lags used: 24
plot(fit_hwmd)
Let’s see what model will be selected by R ets() function.
The model selected by ETS() is multiplicative Error with multiplicative seasonality.
fit_ets <- ets(train)
summary(fit_ets)
## ETS(M,N,M)
##
## Call:
## ets(y = train)
##
## Smoothing parameters:
## alpha = 0.1055
## gamma = 0.2055
##
## Initial states:
## l = 221478.739
## s = 0.1026 0.109 0.4589 1.5244 3.0229 3.5129
## 1.9837 0.7561 0.1416 0.1114 0.1602 0.1165
##
## sigma: 0.2033
##
## AIC AICc BIC
## 12117.06 12118.12 12179.28
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 4920.755 38631.02 22640.88 -5.253336 17.35563 1.076952
## ACF1
## Training set 0.5785562
checkresiduals(fit_ets)
##
## Ljung-Box test
##
## data: Residuals from ETS(M,N,M)
## Q* = 61.677, df = 10, p-value = 1.743e-09
##
## Model df: 14. Total lags used: 24
Calculate train set lambda value
lambda_t <- BoxCox.lambda(train)
lambda_t
## [1] 0.1140685
The model selected by ETS() with box-cox tranformed training data is Additive Error with additive seasonality.
fit_ets_t <- ets(train, lambda = lambda_t)
summary(fit_ets_t)
## ETS(A,N,A)
##
## Call:
## ets(y = train, lambda = lambda_t)
##
## Box-Cox transformation: lambda= 0.1141
##
## Smoothing parameters:
## alpha = 0.0983
## gamma = 0.257
##
## Initial states:
## l = 23.157
## s = -5.1802 -6.0519 -0.2159 4.7843 7.6906 8.1356
## 6.0903 1.7034 -3.9509 -5.1951 -3.2089 -4.6013
##
## sigma: 0.7546
##
## AIC AICc BIC
## 2629.784 2630.846 2692.011
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 7037.772 34816.71 20324.62 -2.86916 16.02903 0.9667752
## ACF1
## Training set 0.5384073
checkresiduals(fit_ets_t)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,A)
## Q* = 49.684, df = 10, p-value = 3.051e-07
##
## Model df: 14. Total lags used: 24
Compare ETS models with and withoouot box-cox transformation: * the AICc value is much lower on the ets model when the data is Box-Cox transformed.
as.data.frame(cbind("ETS() model"=c("Not-transformed", "Box-Cox-transformed") , "AICc"= c(fit_ets$aicc, fit_ets_t$aicc)))
fc_ets<-forecast(fit_ets ,h=12)
fc_ets_t<-forecast(fit_ets_t ,h=12)
accuracy(fc_ets, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 4920.755 38631.02 22640.88 -5.2533363 17.355628 1.076952
## Test set 4366.083 42872.09 28546.23 -0.8386341 8.274478 1.357850
## ACF1 Theil's U
## Training set 0.5785562 NA
## Test set -0.0326593 0.1846774
accuracy(fc_ets_t, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 7037.772 34816.71 20324.62 -2.8691600 16.029027 0.9667752
## Test set 6829.283 38256.02 25839.84 -0.5771483 8.057693 1.2291163
## ACF1 Theil's U
## Training set 0.53840729 NA
## Test set -0.07292969 0.1667021
Based on the Test RMSE, ETS model with Box-Cox transformation is better in forecasting next 12-month visitors than the ETS model without box-cox transformation. However the residual of both models doesn’t pass the white noise test.
** Between Seasonal Naive(base model), Holt-Winters’ Multiplicative Damped method and Exponential Smoothing State Space Model (Additive seasonal component and additive error), the best model is Holt-Winters’ Multiplicative Damped method so far as it has the lowest test error (RMSE and MAPE).**
as.data.frame(cbind("Model"=c( "Seasonal Naive (base model) ", "Holt-Winters' Multi Damped", "ETS(A,N,A) Box-Cox" ) ,
"AICc"= round( c( 0/0, fit_hwmd$model$aicc, fit_ets_t$aicc) , 2) ,
"RMSE" = round( c(accuracy(fc.snaive, test)["Test set",2] , accuracy(fit_hwmd, test)["Test set",2],
accuracy(fc_ets_t, test)["Test set", 2]), 2),
"MAPE(%)" = round( c(accuracy(fc.snaive, test)["Test set",5] , accuracy(fit_hwmd, test)["Test set",5], accuracy(fc_ets_t, test)["Test set", 5]), 2) ))
ACF and PACF plots can help us in choosing an appropriate Arima model
tsdisplay(dfts)
Confirm applying frist order of seasonal differencing gives us a staionary data
dff <- diff(train, lag = 12, differences = 1)
tsdisplay(dff, main="1st order Normal Differenced", lag.max = 48)
Interpreting the ACF and PACF plots above: * Applying 1st order seasonal differencing with lag=12, which means we have seasonal Arima model * Exponential series decaying at ACF plot and sharp cut off at lag=1 on PACF plot implying nonseasonal Auto Regressive AR(1). * Exponential decaying at lag=12, 24, 36 in PACF plot and sharp cut off at lag=12 on ACF plot implying seasonal MA(1) ACF and PACF plots imply a SARIMA (1,0,0)(0,1,1)[12] model
fit_arima_1 <-Arima(train,order=c(1,0,0),seasonal=list(order=c(0,1,1),period=12),lambda = lambda_t)
summary(fit_arima_1)
## Series: train
## ARIMA(1,0,0)(0,1,1)[12]
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ar1 sma1
## 0.3246 -0.6644
## s.e. 0.0468 0.0441
##
## sigma^2 estimated as 0.5716: log likelihood=-522.07
## AIC=1050.14 AICc=1050.2 BIC=1062.51
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 9346.253 31917.8 18485.82 -1.209553 15.24865 0.8793096
## ACF1
## Training set 0.4394491
checkresiduals(fit_arima_1)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(0,1,1)[12]
## Q* = 43.177, df = 22, p-value = 0.004487
##
## Model df: 2. Total lags used: 24
Model Goodness of Fit: Arima Model 1 : (1,0,0)(0,1,1)[12]
fit_arima_1_fc <- forecast(fit_arima_1 ,h = 12)
autoplot(fit_arima_1_fc)+ xlab("Year") + ylab("Number of Visitors")
accuracy(fit_arima_1_fc, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 9346.253 31917.80 18485.82 -1.209553 15.248653 0.8793096
## Test set 17951.883 39817.83 23685.51 4.281207 7.001718 1.1266418
## ACF1 Theil's U
## Training set 0.43944909 NA
## Test set -0.09280763 0.183352
mase(test,fit_arima_1_fc$mean)
## [1] 0.1393509
Then, let’s use auto.arima function
fit_autoarima <- auto.arima(train, stepwise = FALSE, approximation = FALSE, lambda = lambda_t)
summary(fit_autoarima)
## Series: train
## ARIMA(1,0,2)(1,1,1)[12] with drift
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1 drift
## 0.9540 -0.6878 -0.1822 0.1603 -0.8040 0.0041
## s.e. 0.0295 0.0548 0.0476 0.0646 0.0419 0.0020
##
## sigma^2 estimated as 0.5405: log likelihood=-508.26
## AIC=1030.52 AICc=1030.77 BIC=1059.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 5805.302 29971.91 16934.01 -3.817256 15.15294 0.805495
## ACF1
## Training set 0.4042169
auto.arima: (1,0,2)(1,1,1)[12]
checkresiduals(fit_autoarima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,2)(1,1,1)[12] with drift
## Q* = 25.475, df = 18, p-value = 0.1124
##
## Model df: 6. Total lags used: 24
autoplot(fit_autoarima)
Model Goodness of Fit : Arima Model 2 : (1,0,2)(1,1,1)[12]
fit_arima_2_fc <- forecast(fit_autoarima ,h = 12)
autoplot(fit_arima_2_fc)+ xlab("Year") + ylab("Number of Visitors")
accuracy(fit_arima_2_fc, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 5805.302 29971.91 16934.01 -3.817256 15.152935 0.805495
## Test set 8794.693 42860.17 28273.64 0.330978 7.573627 1.344884
## ACF1 Theil's U
## Training set 0.40421686 NA
## Test set -0.08996232 0.1880815
mase(test,fit_arima_2_fc$mean)
## [1] 0.1663446
Find the best Arima model by trying different combinations of autoregressive and moving average order and number of normal and seasonal differencing. As we dicussed during the differencing section above, we are sure that there is clear seasonal compoent within the dataset, I set seasonal differencing order to be 1. For normal differencing, I am open to see if 1-order of normal differencing will help the model predict better than without it, so normal differencing is set to either 0 or 1. The orders for AR and MA (both seasonal and normal component) are between 0, 1, 2. I limit the maximum number for any orders to 2 because I would like to keep the model as simple as possible so that the model is more interpretable.
p <- 0:2; d <- 0:1; q <- 0:2; P <- 0:2; D <- 1; Q <- 0:2
comb <- as.matrix(expand.grid(p,d,q,P,D,Q))
aicc_list <- numeric(nrow(comb))
rmse_list <- numeric(nrow(comb))
model_arima <- rep(0,nrow(comb))
mape_list <- numeric(nrow(comb))
mase_list <- numeric(nrow(comb))
for(k in 1:nrow(comb)){
arima_model <- try(Arima(train,order=c(comb[k,1], comb[k,2], comb[k,3]),
seasonal=list(order=c(comb[k,4], comb[k,5], comb[k,6]),
period=12), lambda = lambda_t), silent = TRUE)
if(!typeof(arima_model) == "character"){
arima_model <- Arima(train,order=c(comb[k,1], comb[k,2], comb[k,3]),
seasonal=list(order=c(comb[k,4], comb[k,5], comb[k,6]),
period=12),lambda = lambda_t)
fc <- forecast(arima_model ,h = 12)
model_arima[k] <- paste("ARIMA", "(", comb[k,1], "," , comb[k,2], ",",
comb[k,3], ")", "," , "(", comb[k,4] ,",", comb[k,5],",", comb[k,6], ")" ,"[12]")
aicc_list[k] <- arima_model$aicc
rmse_list[k] <- round(accuracy(fc, test)["Test set",2])
mape_list[k] <- round(accuracy(fc, test)["Test set",5], 2)
mase_list[k] <- round(mase(test, fc$mean), 4)*100
}
else{
print(paste("Error in Model", "ARIMA", "(", comb[k,1], "," , comb[k,2], ",",
comb[k,3], ")", "," , "(", comb[k,4] ,",", comb[k,5],",", comb[k,6], ")" ,"[12]"))
}
}
## [1] "Error in Model ARIMA ( 2 , 0 , 2 ) , ( 2 , 1 , 0 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 2 , 1 , 0 ) [12]"
## [1] "Error in Model ARIMA ( 2 , 1 , 2 ) , ( 2 , 1 , 0 ) [12]"
## [1] "Error in Model ARIMA ( 2 , 0 , 2 ) , ( 2 , 1 , 1 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 2 , 1 , 1 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 1 , 1 , 2 ) [12]"
## [1] "Error in Model ARIMA ( 2 , 0 , 2 ) , ( 2 , 1 , 2 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 2 , 1 , 2 ) [12]"
arima_table <- as.data.frame(cbind ("ARIMA Model"=model_arima, "AICc" =round(aicc_list,2),
"rmse"=rmse_list, "mape(%)"=mape_list, "mase(%)" = mase_list ) )
arima_table <- arima_table[!(apply(arima_table, 1, function(y) any(y == 0))),]
arima_table$rmse <- as.numeric(as.character(arima_table$rmse))
arima_table$`mape(%)` <- as.numeric(as.character(arima_table$`mape(%)`))
arima_table$`mase(%)` <- as.numeric(as.character(arima_table$`mase(%)`))
arima_table[with(arima_table,order(rmse)),]
arima_table[with(arima_table,order(`mase(%)`)),]
By running a loop to try different combinations of autoregressive and moving average order and number of normal and seasonal differencing, I found that the best ARIMA model is ARIMA ( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12] as it ranks the lowest in test error (using RMSE and MASE metrics).
fit_arima_3 <-Arima(train,order=c(0,1,1), seasonal=list(order=c(1,1,0),period=12), lambda = lambda_t)
summary(fit_arima_3)
## Series: train
## ARIMA(0,1,1)(1,1,0)[12]
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ma1 sar1
## -0.8830 -0.4024
## s.e. 0.0289 0.0434
##
## sigma^2 estimated as 0.667: log likelihood=-554.4
## AIC=1114.79 AICc=1114.85 BIC=1127.16
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1796.977 35338.46 20336.56 -4.092848 16.6471 0.9673431
## ACF1
## Training set 0.5350276
checkresiduals(fit_arima_3)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)(1,1,0)[12]
## Q* = 54, df = 22, p-value = 0.0001645
##
## Model df: 2. Total lags used: 24
Model Goodness of Fit: Arima Model 3 : (0,1,1)(1,1,0)[12]
fit_arima_3_fc <- forecast(fit_arima_3 ,h = 12)
autoplot(fit_arima_3_fc)+ xlab("Year") + ylab("Number of Visitors")
accuracy(fit_arima_3_fc, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 1796.977 35338.46 20336.56 -4.0928478 16.64710 0.9673431
## Test set 2431.471 28162.62 16351.59 -0.9826062 6.28688 0.7777915
## ACF1 Theil's U
## Training set 0.5350276 NA
## Test set -0.3595344 0.08027541
mase(test,fit_arima_3_fc$mean)
## [1] 0.09620266
arima_eva <- as.data.frame(cbind("Model"=c( "Arima(1,0,0)(0,1,1)[12]", "Auto.Airma","Arima(0,1,1)(1,1,0)[12]"),
"RMSE" = round( c(accuracy(fit_arima_1_fc, test)["Test set",2] , accuracy(fit_arima_2_fc, test)["Test set",2],
accuracy(fit_arima_3_fc, test)["Test set", 2]), 2),
"MAPE(%)" = round( c(accuracy(fit_arima_1_fc, test)["Test set",5] , accuracy(fit_arima_2_fc, test)["Test set",5],
accuracy(fit_arima_3_fc, test)["Test set", 5]), 2),
"MASE(%)" = round( c(accuracy(fit_arima_1_fc, test)["Test set",6] , accuracy(fit_arima_2_fc, test)["Test set",6],
accuracy(fit_arima_3_fc, test)["Test set", 6]), 2) ))
arima_eva
Getting temperature and gas price data ready
# weather <- read.csv("/Users/yuanhongzhang/Downloads/2143862.csv")
#
# #extract year and month from DATE
# weather <- weather %>% mutate(cal_TAVG = (TMAX + TMIN)/2)
#
# weather$date <- format(as.Date(weather$DATE, format = "%Y-%m-%d"),"%Y-%m")
#
# #Do a groupby meaning the monthly values
# weather_groupby <- weather %>% group_by(date) %>% summarise(mean=mean(cal_TAVG,na.rm=TRUE))
#
# #create temperature dataframe and time series
# temp_df <- (as.data.frame(weather_groupby))
# temp_ts <- ts(temp_df[['mean']] , start=c(1979, 1), end = c(2018,12),frequency=12)
#
# save(temp_df, file= "/Users/yuanhongzhang/Documents/UChicago/Winter 2020/Time Series/Project/temperature_df.Rdata")
# save(temp_ts, file= "/Users/yuanhongzhang/Documents/UChicago/Winter 2020/Time Series/Project/temperature_ts.Rdata")
Load temperature data
load(file= "temperature_df.Rdata")
load(file= "temperature_ts.Rdata")
#Plot temperature time series
head(temp_df)
autoplot(temp_ts)
#Split temperature into train and test set
temp_train <- temp_df$mean[1:468]
dim(temp_train) <- c(468,1)
temp_test <- temp_df$mean[469:480]
dim(temp_test) <- c(12,1)
Load gas price data
gas_df <- read_excel("gasoline.xlsx")
head(gas_df)
Split gas price dataset into train and test and create dataframe and time series object
gas_train <- gas_df$`Unleaded Regular Gasoline - U.S. City Average Retail Price`[1:468]
dim(gas_train) <- c(468,1)
gas_test <- gas_df$`Unleaded Regular Gasoline - U.S. City Average Retail Price`[469:480]
dim(gas_test) <- c(12,1)
gas_ts <- ts(gas_df$`Unleaded Regular Gasoline - U.S. City Average Retail Price` , start=c(1979, 1), end = c(2018,12), frequency = 12)
autoplot(gas_ts)
gasts_train <- window(gas_ts, start = c(1979, 1), end = c(2017,12), frequency = 12)
gasts_test <- window(gas_ts, start = c(2017, 1), end = c(2018,12), frequency = 12)
plot( temp_train, train,type="p" , main ="Number of Visitors vs Temperature", pch = 16,col = "blue", xlab= "temperature", ylab="Number of Visitors")
plot( gas_train,train, type="p" , main ="Number of Visitors vs Gas Price", pch = 16,col = "blue", xlab= "GasPrice", ylab="Number of Visitors")
print("Correlation between Number of Visitors vs Temperature: " )
## [1] "Correlation between Number of Visitors vs Temperature: "
cor(train, temp_train)
## [,1]
## [1,] 0.8785233
print("Correlation between Number of Visitors vs Gas Price: " )
## [1] "Correlation between Number of Visitors vs Gas Price: "
cor(train, gas_train)
## [,1]
## [1,] 0.1578205
combining visitors, temperature and gas price into one single dataframe
training_df <- as.data.frame(cbind(as.numeric(train), temp_train, gas_train))
test_df <- as.data.frame(cbind(as.numeric(test), temp_test, gas_test))
colnames(training_df) <- c("Visitors","Temperature","GasPrice")
colnames(test_df) <- c("Visitors","Temperature","GasPrice")
Check correlations between variables
training_df %>%
GGally::ggpairs(columns = c("Temperature","GasPrice","Visitors"))
# prepare train time series
training_ts <- ts(data = training_df, start= c(1979,1), end=c(2017,12), frequency = 12)
test_ts <- ts(data = test_df, start= c(2018,1), end=c(2018,12), frequency = 12)
head(training_ts)
## Visitors Temperature GasPrice
## Jan 1979 23605 1.951923 0.716
## Feb 1979 31992 18.875000 0.730
## Mar 1979 17813 24.467742 0.755
## Apr 1979 34095 34.273585 0.802
## May 1979 108952 43.016949 0.844
## Jun 1979 313924 51.210526 0.901
#tslm() model - visitors~ temperature
fit_tslm_temp <- tslm(Visitors~Temperature, data = training_ts)
summary(fit_tslm_temp)
##
## Call:
## tslm(formula = Visitors ~ Temperature, data = training_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -277267 -111458 -3260 99869 390721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -344694.7 16138.0 -21.36 <2e-16 ***
## Temperature 16035.9 403.9 39.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 136100 on 466 degrees of freedom
## Multiple R-squared: 0.7718, Adjusted R-squared: 0.7713
## F-statistic: 1576 on 1 and 466 DF, p-value: < 2.2e-16
plot(temp_train, train, type="p" , main ="Number of Visitors vs Temperature", pch = 16, col = "blue", xlab= "temperature", ylab="Number of Visitors")
abline(fit_tslm_temp$coefficients[1],fit_tslm_temp$coefficients[2], lwd=2)
checkresiduals(fit_tslm_temp, test="LB")
##
## Ljung-Box test
##
## data: Residuals from Linear regression model
## Q* = 2097.4, df = 22, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 24
fc_tslm_temp <- forecast(fit_tslm_temp, newdata=data.frame(Temperature = test_ts[,"Temperature"]), h=12)
plot(fc_tslm_temp)
accuracy(fc_tslm_temp, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -4.271197e-12 135803.0 113844.3 -45.930931 258.9865 5.415197
## Test set 8.998130e+04 175778.3 136693.7 6.943095 148.5744 6.502068
## ACF1 Theil's U
## Training set 0.5109284 NA
## Test set 0.4761760 0.9903687
fit_tslm_gas <- tslm(Visitors~GasPrice, data = training_ts)
summary(fit_tslm_gas)
##
## Call:
## tslm(formula = Visitors ~ GasPrice, data = training_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -333796 -202214 -161080 225827 726749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 154226 29427 5.241 2.43e-07 ***
## GasPrice 51659 14973 3.450 0.000611 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 281300 on 466 degrees of freedom
## Multiple R-squared: 0.02491, Adjusted R-squared: 0.02281
## F-statistic: 11.9 on 1 and 466 DF, p-value: 0.0006113
plot( gas_train, train, type="p" , main ="Number of Visitors vs Gas Price", pch = 16, col = "blue", xlab= "Gas Price", ylab="Number of Visitors")
abline(fit_tslm_gas$coefficients[1],fit_tslm_temp$coefficients[2], lwd=2)
checkresiduals(fit_tslm_gas, test="LB")
##
## Ljung-Box test
##
## data: Residuals from Linear regression model
## Q* = 3489.8, df = 22, p-value < 2.2e-16
##
## Model df: 2. Total lags used: 24
fc_tslm_gas <- forecast(fit_tslm_gas, newdata=data.frame(GasPrice = test_ts[,"GasPrice"]), h=12)
plot(fc_tslm_gas)
accuracy(fc_tslm_gas, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 9.653786e-12 280722.8 242925.6 -555.2211 593.0038 11.55517
## Test set 4.738567e+04 358231.1 322652.0 -531.1502 578.3562 15.34749
## ACF1 Theil's U
## Training set 0.7894059 NA
## Test set 0.7311646 2.278045
fit_tslm_both <- tslm(Visitors~Temperature+GasPrice, data = training_ts)
summary(fit_tslm_both)
##
## Call:
## tslm(formula = Visitors ~ Temperature + GasPrice, data = training_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -307693 -114616 -894 98003 382997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375915.8 19503.3 -19.275 < 2e-16 ***
## Temperature 15911.5 403.4 39.441 < 2e-16 ***
## GasPrice 20305.5 7234.4 2.807 0.00521 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135100 on 465 degrees of freedom
## Multiple R-squared: 0.7756, Adjusted R-squared: 0.7746
## F-statistic: 803.6 on 2 and 465 DF, p-value: < 2.2e-16
checkresiduals(fit_tslm_both, test="LB")
##
## Ljung-Box test
##
## data: Residuals from Linear regression model
## Q* = 2185.1, df = 21, p-value < 2.2e-16
##
## Model df: 3. Total lags used: 24
fc_tslm_both <- forecast(fit_tslm_both, newdata=data.frame(Temperature = test_ts[,"Temperature"],GasPrice= test_ts[,"GasPrice"]), h=12)
plot(fc_tslm_both)
accuracy(fc_tslm_both, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -3.190982e-12 134667.0 112605.6 -45.26767 259.4967 5.356278
## Test set 7.029709e+04 166427.5 130448.4 -33.64160 155.7441 6.205004
## ACF1 Theil's U
## Training set 0.5015074 NA
## Test set 0.4728120 0.984068
Inspecting the residuals of the three LSTM models, we can see that there are still time series components within the residuals of TSLM (time series linear model), we can proceed to ARIMAX model.
tslm_eva <- as.data.frame(cbind("Model"=c( "TSLM (Visitors vs Temperature) ", "TSLM (Visitors vs Gas Price) ","TSLM (Visitors vs Temperature + Gas Price) "),
"RMSE" = round( c(accuracy(fc_tslm_temp, test)["Test set",2] , accuracy(fc_tslm_gas, test)["Test set",2],
accuracy(fc_tslm_both, test)["Test set", 2]), 2),
"MAPE(%)" = round( c(accuracy(fc_tslm_temp, test)["Test set",5] , accuracy(fc_tslm_gas, test)["Test set",5],
accuracy(fc_tslm_both, test)["Test set", 5]), 2),
"MASE(%)" = round( c(accuracy(fc_tslm_temp, test)["Test set",6] , accuracy(fc_tslm_gas, test)["Test set",6],
accuracy(fc_tslm_both, test)["Test set", 6]), 2) ))
tslm_eva
Prepare xreg for Arimax
xreg_train <- cbind(temp_train, gas_train)
xreg_test <- cbind(temp_test, gas_test)
# xreg = Temperature and gas price
fit_autoarimax_gas_temp <- auto.arima(y = train, D=1, stepwise=FALSE , approximation=FALSE,xreg =xreg_train, seasonal = TRUE , lambda = lambda_t)
summary(fit_autoarimax_gas_temp)
## Series: train
## Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1 drift xreg1 xreg2
## 0.9463 -0.6843 -0.1938 0.1757 -0.8036 0.0047 0.0419 -0.2144
## s.e. 0.0397 0.0609 0.0488 0.0645 0.0414 0.0017 0.0091 0.1284
##
## sigma^2 estimated as 0.5156: log likelihood=-496.38
## AIC=1010.76 AICc=1011.16 BIC=1047.86
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 5869.082 30516.22 17291.93 -3.726119 14.88046 0.8225203
## ACF1
## Training set 0.4103654
print( "p-values of model parameters" )
## [1] "p-values of model parameters"
(1-pnorm(abs(fit_autoarimax_gas_temp$coef)/sqrt(diag(fit_autoarimax_gas_temp$var.coef))))*2
## ar1 ma1 ma2 sar1 sma1
## 0.000000e+00 0.000000e+00 7.042706e-05 6.485191e-03 0.000000e+00
## drift xreg1 xreg2
## 6.605605e-03 4.060019e-06 9.483444e-02
checkresiduals(fit_autoarimax_gas_temp)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Q* = 23.772, df = 16, p-value = 0.0946
##
## Model df: 8. Total lags used: 24
fit_autoarimax_temp <- auto.arima(y = train, D=1, xreg =temp_train, stepwise=FALSE , approximation=FALSE, seasonal = TRUE , lambda = lambda_t )
summary(fit_autoarimax_temp)
## Series: train
## Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1 drift xreg
## 0.9560 -0.6885 -0.1902 0.1816 -0.8066 0.0039 0.0424
## s.e. 0.0304 0.0554 0.0482 0.0642 0.0413 0.0020 0.0091
##
## sigma^2 estimated as 0.5174: log likelihood=-497.67
## AIC=1011.33 AICc=1011.65 BIC=1044.31
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 5852.07 30926.69 17412.76 -3.701379 14.89393 0.828268
## ACF1
## Training set 0.4199537
print( "p-values of model parameters" )
## [1] "p-values of model parameters"
(1-pnorm(abs(fit_autoarimax_temp$coef)/sqrt(diag(fit_autoarimax_temp$var.coef))))*2
## ar1 ma1 ma2 sar1 sma1
## 0.000000e+00 0.000000e+00 7.887126e-05 4.678755e-03 0.000000e+00
## drift xreg
## 5.119174e-02 3.100831e-06
checkresiduals(fit_autoarimax_temp)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Q* = 24.19, df = 17, p-value = 0.1144
##
## Model df: 7. Total lags used: 24
fit_autoarimax_gas <- auto.arima(y = train, D=1, xreg =gas_train, stepwise=FALSE , approximation=FALSE, seasonal = TRUE , lambda = lambda_t )
summary(fit_autoarimax_gas)
## Series: train
## Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ar1 ma1 ma2 sar1 sma1 drift xreg
## 0.9434 -0.6838 -0.1855 0.1552 -0.8012 0.0051 -0.2410
## s.e. 0.0382 0.0598 0.0481 0.0648 0.0418 0.0018 0.1313
##
## sigma^2 estimated as 0.538: log likelihood=-506.72
## AIC=1029.44 AICc=1029.76 BIC=1062.42
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 5832.571 29639.49 16773.97 -3.844213 15.11283 0.7978825
## ACF1
## Training set 0.3971867
print( "p-values of model parameters" )
## [1] "p-values of model parameters"
(1-pnorm(abs(fit_autoarimax_gas$coef)/sqrt(diag(fit_autoarimax_gas$var.coef))))*2
## ar1 ma1 ma2 sar1 sma1
## 0.0000000000 0.0000000000 0.0001132871 0.0165884570 0.0000000000
## drift xreg
## 0.0041732300 0.0664623723
checkresiduals(fit_autoarimax_gas)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Q* = 25.092, df = 17, p-value = 0.09267
##
## Model df: 7. Total lags used: 24
Evaluate model fitness for ARIMAX auto.arima
fc_autoarimax_gas_temp <- forecast(fit_autoarimax_gas_temp, xreg = xreg_test, h=12)
autoplot(fc_autoarimax_gas_temp ) + xlab("Month") + ylab("Average Visitors")
fc_autoarimax_temp <- forecast(fit_autoarimax_temp, xreg = temp_test, h=12)
autoplot(fc_autoarimax_temp ) + xlab("Month") + ylab("Average Visitors")
fc_autoarimax_gas <- forecast(fit_autoarimax_gas, xreg = gas_test, h=12)
autoplot(fc_autoarimax_gas ) + xlab("Month") + ylab("Average Visitors")
auto_arimax_eva <- as.data.frame(cbind("Model"=c( "auto.airmax (Visitors vs Temperature) ", "auto.airmax (Visitors vs Gas Price) ","auto.airmax (Visitors vs Temperature + Gas Price) "),
"RMSE" = round( c(accuracy(fc_autoarimax_temp, test)["Test set",2] , accuracy(fc_autoarimax_gas, test)["Test set",2],
accuracy(fc_autoarimax_gas_temp, test)["Test set", 2]), 2),
"MAPE(%)" = round( c(accuracy(fc_autoarimax_temp, test)["Test set",5] , accuracy(fc_autoarimax_gas, test)["Test set",5],
accuracy(fc_autoarimax_gas_temp, test)["Test set", 5]), 2),
"MASE(%)" = round( c(accuracy(fc_autoarimax_temp, test)["Test set",6] , accuracy(fc_autoarimax_gas, test)["Test set",6],
accuracy(fc_autoarimax_gas_temp, test)["Test set", 6]), 2) ))
auto_arimax_eva
Second ARIMAX model with the order from the best ARIMA model selected above : ARIMA( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12]
fit_Arimax_gas_temp <- Arima(y = train, order = c(0,1,1), seasonal = list(order = c(1,1,0), period = 12), xreg =xreg_train, lambda = lambda_t)
summary(fit_Arimax_gas_temp)
## Series: train
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ma1 sar1 xreg1 xreg2
## -0.8998 -0.3970 0.0407 -0.1889
## s.e. 0.0307 0.0437 0.0088 0.1228
##
## sigma^2 estimated as 0.6355: log likelihood=-542.42
## AIC=1094.84 AICc=1094.97 BIC=1115.44
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1973.735 36102.21 20595.89 -3.932287 16.32299 0.9796788
## ACF1
## Training set 0.5575194
# check p-values of model parameters
(1-pnorm(abs(fit_Arimax_gas_temp$coef)/sqrt(diag(fit_Arimax_gas_temp$var.coef))))*2
## ma1 sar1 xreg1 xreg2
## 0.000000e+00 0.000000e+00 3.423757e-06 1.241409e-01
checkresiduals(fit_Arimax_gas_temp)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Q* = 53.685, df = 20, p-value = 6.442e-05
##
## Model df: 4. Total lags used: 24
# xreg = Temperature only
fit_Arimax_temp <- Arima(y = train, order = c(0,1,1), seasonal = list(order = c(1,1,0), period = 12), xreg =temp_train, lambda = lambda_t)
summary(fit_Arimax_temp)
## Series: train
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ma1 sar1 xreg
## -0.8907 -0.3912 0.0413
## s.e. 0.0306 0.0437 0.0088
##
## sigma^2 estimated as 0.6375: log likelihood=-543.56
## AIC=1095.13 AICc=1095.21 BIC=1111.61
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 2022.477 36742.88 20985.84 -3.863416 16.35694 0.9982274
## ACF1
## Training set 0.5634653
# check p-values of model parameters
(1-pnorm(abs(fit_Arimax_temp$coef)/sqrt(diag(fit_Arimax_temp$var.coef))))*2
## ma1 sar1 xreg
## 0.000000e+00 0.000000e+00 2.488569e-06
checkresiduals(fit_Arimax_temp)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Q* = 54.66, df = 21, p-value = 7.923e-05
##
## Model df: 3. Total lags used: 24
fit_Arimax_gas <- Arima(y = train, order = c(0,1,1), seasonal = list(order = c(1,1,0), period = 12), xreg =gas_train, lambda = lambda_t )
summary(fit_Arimax_gas)
## Series: train
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Box Cox transformation: lambda= 0.1140685
##
## Coefficients:
## ma1 sar1 xreg
## -0.8918 -0.4077 -0.2189
## s.e. 0.0287 0.0434 0.1274
##
## sigma^2 estimated as 0.6641: log likelihood=-552.97
## AIC=1113.93 AICc=1114.02 BIC=1130.41
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 1736.129 34721.27 19901.44 -4.163424 16.57841 0.9466463
## ACF1
## Training set 0.5283516
# check p-values of model parameters
(1-pnorm(abs(fit_Arimax_gas$coef)/sqrt(diag(fit_Arimax_gas$var.coef))))*2
## ma1 sar1 xreg
## 0.00000000 0.00000000 0.08571859
checkresiduals(fit_Arimax_gas)
##
## Ljung-Box test
##
## data: Residuals from Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Q* = 52.941, df = 21, p-value = 0.0001402
##
## Model df: 3. Total lags used: 24
Evaluate ARIMAX modesls(( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12]) on test dataset
fc_Arimax_gas_temp <- forecast(fit_Arimax_gas_temp, xreg = xreg_test, h=12)
autoplot(fc_Arimax_gas_temp ) + xlab("Month") + ylab("Average Visitors")
fc_Arimax_temp <- forecast(fit_Arimax_temp, xreg = temp_test, h=12)
autoplot(fc_Arimax_temp ) + xlab("Month") + ylab("Average Visitors")
fc_Arimax_gas <- forecast(fit_Arimax_gas, xreg = gas_test, h=12)
autoplot(fc_Arimax_gas ) + xlab("Month") + ylab("Average Visitors")
Model fitness evaluation summary
Arimax_eva <- as.data.frame(cbind("Model"=c( "ARIMAX(0,1,1)(1,1,0)[12] (Visitors vs Temperature) ", "ARIMAX(0,1,1)(1,1,0)[12] (Visitors vs Gas Price) ","ARIMAX(0,1,1)(1,1,0)[12] (Visitors vs Temperature + Gas Price) "),
"RMSE" = round( c(accuracy(fc_Arimax_temp, test)["Test set",2] , accuracy(fc_Arimax_gas, test)["Test set",2],
accuracy(fc_Arimax_gas_temp, test)["Test set", 2]), 2),
"MAPE(%)" = round( c(accuracy(fc_Arimax_temp, test)["Test set",5] , accuracy(fc_Arimax_gas, test)["Test set",5],
accuracy(fc_Arimax_gas_temp, test)["Test set", 5]), 2),
"MASE(%)" = round( c(accuracy(fc_Arimax_temp, test)["Test set",6] , accuracy(fc_Arimax_gas, test)["Test set",6],
accuracy(fc_Arimax_gas_temp, test)["Test set", 6]), 2) ))
Arimax_eva
arimx_table <- rbind(Arimax_eva, auto_arimax_eva)
arimx_table
From the summary of prediction errors of 8 models below, noted the ARIMAX(0,1,1)(1,1,0)[12] model with gas and temperature as predicting variables has the lowest RMSE, and therefore its the best model
modelname <- c("snaive",
"Holt-Winters' Multi Damped",
"ETS(A,N,A) Box-Cox",
"Arima(0,1,1)(1,1,0)[12]",
"auto.airmax(Temperature)",
"ARIMAX(0,1,1)(1,1,0)[12](Temperature)",
"ARIMAX(0,1,1)(1,1,0)[12](Gas Price)",
"ARIMAX(0,1,1)(1,1,0)[12](Temperature + Gas Price)")
eva_final <- as.data.frame( cbind (modelname ,
"RMSE"= round(c(accuracy(fc.snaive, test)["Test set",2],
accuracy(fit_hwmd, test)["Test set",2],
accuracy(fc_ets_t, test)["Test set", 2],
accuracy(fit_arima_3_fc, test)["Test set", 2],
accuracy(fc_autoarimax_temp, test)["Test set",2],
accuracy(fc_Arimax_temp, test)["Test set",2] ,
accuracy(fc_Arimax_gas, test)["Test set",2],
accuracy(fc_Arimax_gas_temp, test)["Test set", 2]),2),
"MAPE(%)"= round(c(accuracy(fc.snaive, test)["Test set",5],
accuracy(fit_hwmd, test)["Test set",5],
accuracy(fc_ets_t, test)["Test set", 5],
accuracy(fit_arima_3_fc, test)["Test set", 5],
accuracy(fc_autoarimax_temp, test)["Test set",5],
accuracy(fc_Arimax_temp, test)["Test set",5] ,
accuracy(fc_Arimax_gas, test)["Test set",5],
accuracy(fc_Arimax_gas_temp, test)["Test set", 5]),2),
"MASE(%)"= round(c(accuracy(fc.snaive, test)["Test set",6],
accuracy(fit_hwmd, test)["Test set",6],
accuracy(fc_ets_t, test)["Test set", 6],
accuracy(fit_arima_3_fc, test)["Test set", 6],
accuracy(fc_autoarimax_temp, test)["Test set",6],
accuracy(fc_Arimax_temp, test)["Test set",6] ,
accuracy(fc_Arimax_gas, test)["Test set",6],
accuracy(fc_Arimax_gas_temp, test)["Test set", 6]),2)
))
eva_final