Problem Statement

To identify patterns and trends of monthly visitors to Yellowstone National Park and forecast next 12-month visitors

I was very lucky to have a chance to visit the Yellowstone National Park with a friend during the week of July 4th, 2019. I had such a wonderful experience visiting Park. I remember how much I enjoyed walking around the park and seeing those beautiful and colorful hot springs and geysers basins, and how excited I was when we run into our natural friends, bisons and deers, multiple times.

The park is open to the public 24/7 all year round. Every year there are more than 4 millions visitors to the Park, and number is growing year by year(might not held true for 2020 due to Covid). It would be great if we can make up-to-date forecast of visitor volumn to help the park service team to better plan and allocate resources. Being able to predict visitor volumn will help us not only plan for construction and/or maintaining of roads, lodges and campgrouds, but also reserve time for recovery of the Park. Ultmiately we hope to preserve the ecosystem (the natures and wildlife) and sustain the beauty of the Park to all visitors for many generations to come. I had a chance to work with a team to perform time series analysis on number of monthly visitors to Yellowstone National Park and build machine learning models to forecast next 12 month’s number of visitors.

Data Source: * Monthly Visitors to Yellowstone Nation Park: Integrated Resource Management Applications (IRMA) * Month Temperature: National Center for Environmental Information * Gas Price: U.S. Energy Information Administration

Load Necessary Libraries

library("fma")
library("fpp")
library("forecast")
library("TSA")
library("zoo")
library(lubridate)
library("ggplot2")
library("MLmetrics")
library("imputeTS")
library("gdata")
library("readxl")
library("GGally")
#library("MLmetrics")
library(dplyr)
library(Metrics)
library(dygraphs)
library("tseries")
accuracy <- forecast::accuracy
library("urca")
library(tidyverse)
library(tsibble)
library("vars")

Load Data and Preparation

Read the raw dataset and convert it to time series data

rawdata <- read.csv("FINAL_CLEAN_DATASET.csv")
head(rawdata)
dfts <-  ts(rawdata$Visitors ,  start=c(1979, 1), end = c(2018,12),frequency = 12)
head(dfts,24)
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1979  23605  31992  17813  34095 108952 313924 493902 493915 262786  76542
## 1980  24757  36373  28112  32466 101988 325211 557810 523549 229396  93240
##         Nov    Dec
## 1979  12698  22684
## 1980  25773  21594
save(dfts, file= "ts_yellow_stone_park.Rdata")
#load(file = "ts_yellow_stone_park.Rdata")

Train Test Split

Split data into train and test set * Train Set : 1979-2017 (39 years : 468 months ) * Test Set : 2018 (1 year : 12 months)

train <- window(dfts, start = c(1979, 1), end = c(2017, 12), frequency = 12 )
test <- window(dfts, start = c(2018, 1), end = c(2018, 12), frequency = 12 )

Data Exploratory Analysis

Data visualization & basic stats

First Impression of the data : * Compute mean and variance * Create time series plot for number of visitors (monthly)

We can see there is strong seasonality, the variance of data increases over time, and there is tiny upward trend in the data.

mean(train)
## [1] 245301.8
var(train)
## [1] 80991288221
autoplot(train, main = "Monthly Number of Visitors to Yellow Stone National Park",
         xlab = "Year", ylab="Visitors")

Data Decomposition

Let’s take a closer look by breaking down the data and see what are the components. I created Additive and Multiplicative Decomposition plots. Both Decomposition plots show there is a upward linear trend(second component), strong seasonality (third component) and some randomnese (4th component) in the data. Compare between additive and multiplicative decomposition, I think the multiplicative one is a better representation of the data as the last component “random” looks more like a white nosie than that of the additive decomposition.

decom <- decompose(train)
plot(decom)
decom_add <- decompose(train, type = c("additive"))
plot(decom_add)

decom_mul <- decompose(train, type = c("multiplicative"))
plot(decom_mul)   

Seasonality component

Let’s take a look at the seasonality component of the data by creating seasonal plot Here is what we see: - Repeated seasonality within each year - November to April : low season - May to October : busy season with peak in July

seasonplot(train,  type = "o",
            main="Seasonal Plot : Visitors to Yellow Stone Park", ylab = "Visitors",
           xlab = "Month", col = 4)

Box-cox transformation

We noticed that the variance of the data increases over years. Let’s address this with Box-cox transformation.

lambda_t <- BoxCox.lambda(train)
lambda_t
## [1] 0.1140685
cbind("Before Transformation" = train,
      "Box-cox transformed" = BoxCox(train, lambda = "auto")) %>%
  autoplot(facets=TRUE) +
  xlab("Year") + ylab("") +
  ggtitle("Number of Visitors to Yellow Stone Park")

Is the data staionary ?

Check ACF and PACF plots

Here is the ADF and PACF plot of the data. We can see that the data is not stationary as there is sinusoid pattern(sine wave) in ACF plot and the values at lag-1 of lag-2 of PACF are significant. We will come back to this with differencing to remove/reduce non-stationarity.

# Display the data in time-series plot and ACF and PACF plot
ggtsdisplay(train,main='No.Of Visitors')

# Another way to display the data in time-series plot and ACF and PACF plot
tsdisplay(train)

# ACF plot
acf(train,  48)

# PACF plot
pacf(train,  48)

Test for stationary

Augmented Dickey-Fuller Test

  • H0 (Null Hypothesis) : time series is NOT a level or trend stationary univariate time series
  • H1 (Alternate Hypothesis) :time series is a level or trend stationary univariate time series

As the p-value of ADF test is smaller than 0.01 (<0.05), we reject the null hypothesis and accept the alternative hypothesis, meaning that the ADF test tells us that the data is level or trend stationary.

adf.test(train)
## Warning in adf.test(train): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  train
## Dickey-Fuller = -22.189, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

KPSS test

KPSS - short version of the truncation lag
  • H0 (Null Hypothesis) : time series is a level or trend stationary univariate time series
  • H1 (Alternate Hypothesis) : time series is NOT a level or trend stationary univariate time series

The p-value of the KPSS test is slightly greater than 0.05, which suggest that there is no enough evidence to reject the null hypothesis if we are following the p-value as of 0.05 threshold. The KPSS test suggests that data is level or trend stationary.

kpss.test(train,null = c("Level", "Trend"))
## 
##  KPSS Test for Level Stationarity
## 
## data:  train
## KPSS Level = 0.42703, Truncation lag parameter = 5, p-value =
## 0.06551
KPSS Test - long version of the truncation lag

The p-value is smaller than 0.01 once we specify the long version of the truncation lag is used by changing lshort parameter from TRUE(default) to FALSE. Therefore, KPSS test with longer lag suggests the time series is not stationary.

kpss.test(train,null = c("Level", "Trend"), lshort='FALSE')
## Warning in kpss.test(train, null = c("Level", "Trend"), lshort = "FALSE"):
## p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  train
## KPSS Level = 0.84359, Truncation lag parameter = 17, p-value =
## 0.01

We see conflicting results from the three tests above(ADF test, KPSS test and KPSS test with longer lag). While the ACF and PACF plots shows there are signs of auto-coorelation with the data. There is a seasonal pattern which needs to be extracted from the data to improve modelling. We will proceed with differencing to remove the seasonal pattern.

What ACF and PACF plots tell us about stationarity?

Here we can see a sine-curve pattern at the ACF plot with significant spikes at lag-1, lag-12, 24, 36, 48 which indicates at least the time series has seasonal component and is Not stationary.

tsdisplay(train, lag.max = 48)

Differencing to acheive stationarity

Differencing-1: nonseasonal order = 1 , seasonal order = 0

Here I only applied normal differencing with order equal to 1 on the data, however seasonality remains in the data after applying it.

dff.1 <- diff(train, lag = 1, differences =1 )
tsdisplay(dff.1, main="Differencing : nonseasonal order = 1 , seasonal order = 0")

Since seasonality is strong in this time series, it is recommended that seasonal differencing be done first then normal differencing, because sometimes the resulting series will be stationary and there will be no need for further first difference. Let’s try this.

Differencing-2 : nonseasonal order = 0, seasonal order = 1

Applying 1st order seasonal differencing removes the seasonality from the time series. We see there is a significant spikes at lag 1 in the PACF plot , none beyond lag 1 and the ACF is exponentially decayig after we applied 1st order seasonal differencing on the time series. This signals the data may contains an autoregressive component with order of 1.

dff.2 <- diff(train, lag = 12, differences =1 )
tsdisplay(dff.2, main="Differencing nonseasonal order = 0, seasonal order =  1")

Let’s apply another differencing, 1st order normal differencing on the top of 1st-order-seasonal-differenced data.

Differencing-3 : seasonal order = 1, nonseasonal order = 1

We see that the spike at lag-1 on the ACF plot actually points downward, autocorrelation of lag-1 changed from positive to negative, after applying the second differencing (1st order normal differencing), indicating that we might be over-differencing the time series. I decided to pursue only one differencing, first order seasonal differencing, as the differencing of choice.

# 1st order Seasonal Differencing followed by 1st order Normal Differencing
dff.3 <- diff(dff.2, lag = 1, differences = 1)
tsdisplay(dff.3 , main="Differencing : nonseasonal order = 1 , seasonal order =  1")

Final Decided Differencing : nonseasonal order = 0 , seasonal order = 1

dff <- diff(train, lag = 12, differences = 1)
tsdisplay(dff, main="1st order Seasonal Differenced", lag.max = 48)

Modeling

Metrics for Model Evaluation

For this project, I choose the following metrics: * RMSE as the primary metrics * MAPE and AICc(if available) as complementary

Base model- Seasonal naive modele

Train Seasonal naive model

fc.snaive <- snaive(train, h=12)
fc.snaive$model
## Call: snaive(y = train, h = 12) 
## 
## Residual sd: 36371.7189

Forecast by the seasonal naive method

fc.snaive$mean
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 2018  29518  32275  23897  45160 419635 803652 962404 916166 640068 211987
##         Nov    Dec
## 2018  10468  21294
autoplot(fc.snaive)

check model fitness We forecast the number of visitors for next 12 month and compare the forecasted value with our test set.

#check forecast errors
Box.test(fc.snaive, lag = 12,type = c("Ljung-Box"))
## 
##  Box-Ljung test
## 
## data:  fc.snaive
## X-squared = Inf, df = 12, p-value < 2.2e-16
checkresiduals(fc.snaive, lag=12)

## 
##  Ljung-Box test
## 
## data:  Residuals from Seasonal naive method
## Q* = 210.29, df = 12, p-value < 2.2e-16
## 
## Model df: 0.   Total lags used: 12
accuracy(fc.snaive, test)
##                    ME     RMSE      MAE       MPE      MAPE     MASE
## Training set 4876.351 36657.60 21023.11 -2.711473 17.332222 1.000000
## Test set     -127.000 39734.67 22027.33  1.316248  8.099741 1.047768
##                    ACF1  Theil's U
## Training set  0.5400440         NA
## Test set     -0.3172638 0.07413897

How was our base model(seasonal naive) performing ? As the p-value of Ljung_box test is smaller than 0.05, we rejects the null hypothesis that the time series isn’t autocorrelated, and therefore the residuals of snaive mode doesn’t qualified as white noise. Therefore we note that there is room for improvement because there are still patterns within the data that’s not captured by the seasonal naive model.

Exponential Smoothing: ETS & Holt-Winters Method

Simple exponential smoothing

fit_ses <- ses(train, h = 12)
fit_ses$model
## Simple exponential smoothing 
## 
## Call:
##  ses(y = train, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 185751.3911 
## 
##   sigma:  182850.1
## 
##      AIC     AICc      BIC 
## 14222.45 14222.50 14234.90
#check model fitness - ses
checkresiduals(fit_ses, lag=12)

## 
##  Ljung-Box test
## 
## data:  Residuals from Simple exponential smoothing
## Q* = 1230.7, df = 10, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 12
accuracy(fit_ses, test)
##                       ME   RMSE      MAE        MPE      MAPE      MASE
## Training set   -351.4421 182459 130902.1 -106.32939 154.67501  6.226582
## Test set     321623.7473 484053 322864.4   53.41672  62.37515 15.357598
##                   ACF1 Theil's U
## Training set 0.6423947        NA
## Test set     0.7323945  1.053652

Holt’s linear trend method

fit_holt <- holt(train, h=12)
fit_holt$model
## Holt's method 
## 
## Call:
##  holt(y = train, h = 12) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.0067 
## 
##   Initial states:
##     l = -12474.4804 
##     b = 36708.3201 
## 
##   sigma:  184245.2
## 
##      AIC     AICc      BIC 
## 14231.55 14231.68 14252.29
checkresiduals(fit_holt, lag=12)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt's method
## Q* = 1229.6, df = 8, p-value < 2.2e-16
## 
## Model df: 4.   Total lags used: 12
accuracy(fit_holt, test)
##                     ME     RMSE      MAE        MPE      MAPE      MASE
## Training set -11777.05 183456.1 133244.9 -135.69544 171.11398  6.338022
## Test set     321904.37 484261.0 323065.9   54.04817  62.43508 15.367182
##                   ACF1 Theil's U
## Training set 0.6441791        NA
## Test set     0.7324382  1.054261

Holt-Winters seasonal method - additive

fit_hwa <- hw(train, seasonal ="additive", h=12)
fit_hwa$model
## Holt-Winters' additive method 
## 
## Call:
##  hw(y = train, h = 12, seasonal = "additive") 
## 
##   Smoothing parameters:
##     alpha = 0.6892 
##     beta  = 1e-04 
##     gamma = 0.2491 
## 
##   Initial states:
##     l = 185466.1856 
##     b = 395.5252 
##     s = -223719.6 -229754.7 -108230.4 174130.3 444516.3 532535.6
##            304503.9 -23480.65 -217854.5 -226139.9 -209993 -216513.3
## 
##   sigma:  42674.06
## 
##      AIC     AICc      BIC 
## 12874.22 12875.58 12944.75
checkresiduals(fit_hwa)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' additive method
## Q* = 797.06, df = 8, p-value < 2.2e-16
## 
## Model df: 16.   Total lags used: 24
accuracy(fit_hwa, test)
##                      ME     RMSE      MAE       MPE     MAPE     MASE
## Training set  -95.87312 41938.24 30673.87 -19.71565 70.10240 1.459055
## Test set     3763.37805 54314.22 43441.55 -44.00666 82.91316 2.066372
##                   ACF1 Theil's U
## Training set 0.4106732        NA
## Test set     0.4411813 0.3726546

Holt-Winters seasonal method - multiplicative

fit_hwm <- hw(train, seasonal ="multiplicative", h=12)
fit_hwm$model
## Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = train, h = 12, seasonal = "multiplicative") 
## 
##   Smoothing parameters:
##     alpha = 0.0259 
##     beta  = 8e-04 
##     gamma = 0.4204 
## 
##   Initial states:
##     l = 178505.3378 
##     b = 2471.0002 
##     s = 0.211 0.1744 0.4579 1.3074 2.8496 3.1763
##            2.0406 0.7902 0.2277 0.2081 0.3821 0.1747
## 
##   sigma:  0.2239
## 
##      AIC     AICc      BIC 
## 12240.96 12242.32 12311.48
checkresiduals(fit_hwm)

## 
##  Ljung-Box test
## 
## data:  Residuals from Holt-Winters' multiplicative method
## Q* = 149.52, df = 8, p-value < 2.2e-16
## 
## Model df: 16.   Total lags used: 24
accuracy(fit_hwm, test)
##                       ME     RMSE      MAE       MPE      MAPE      MASE
## Training set   -822.4685 34492.88 19821.01 -9.562935 18.933591 0.9428201
## Test set     -17003.4617 43541.78 25905.37 -6.710215  8.773936 1.2322334
##                   ACF1  Theil's U
## Training set 0.5876556         NA
## Test set     0.2041875 0.07354712

Holt-Winters seasonal method - additive - damped

fit_hwad <- hw(train, seasonal ="additive", h=12, damped =TRUE)
fit_hwad$model
## Damped Holt-Winters' additive method 
## 
## Call:
##  hw(y = train, h = 12, seasonal = "additive", damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.0881 
##     beta  = 1e-04 
##     gamma = 0.5801 
##     phi   = 0.98 
## 
##   Initial states:
##     l = 178757.1427 
##     b = 813.1658 
##     s = -223719.5 -229754.7 -108230.5 174130.7 444515.8 532535.5
##            304504.5 -23480.72 -218189.7 -226169.8 -209993.7 -216147.9
## 
##   sigma:  38113.43
## 
##      AIC     AICc      BIC 
## 12769.40 12770.92 12844.07
checkresiduals(fit_hwad)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' additive method
## Q* = 359.87, df = 7, p-value < 2.2e-16
## 
## Model df: 17.   Total lags used: 24
accuracy(fit_hwad, test)
##                     ME     RMSE      MAE       MPE     MAPE      MASE
## Training set  1920.405 37414.79 24882.90 -9.814040 40.94888 1.1835977
## Test set     -1648.958 31452.33 19129.21 -5.523559 10.32900 0.9099136
##                    ACF1  Theil's U
## Training set  0.5697702         NA
## Test set     -0.2911233 0.09409559

Holt-Winters seasonal method - multiplicative - damped

fit_hwmd <- hw(train, seasonal ="multiplicative", h=12,  damped =TRUE)
fit_hwmd$model
## Damped Holt-Winters' multiplicative method 
## 
## Call:
##  hw(y = train, h = 12, seasonal = "multiplicative", damped = TRUE) 
## 
##   Smoothing parameters:
##     alpha = 0.031 
##     beta  = 1e-04 
##     gamma = 0.4954 
##     phi   = 0.9722 
## 
##   Initial states:
##     l = 178505.7964 
##     b = 2470.5481 
##     s = 0.2111 0.0251 0.3669 1.4643 2.7496 3.1089
##            2.1438 0.8027 0.2369 0.1557 0.3581 0.3769
## 
##   sigma:  0.2607
## 
##      AIC     AICc      BIC 
## 12360.60 12362.12 12435.27
checkresiduals(fit_hwmd)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' multiplicative method
## Q* = 51.941, df = 7, p-value = 5.997e-09
## 
## Model df: 17.   Total lags used: 24
accuracy(fit_hwmd, test)
##                     ME     RMSE      MAE       MPE      MAPE      MASE
## Training set 2395.3967 33439.33 19288.06 -6.714892 18.197791 0.9174695
## Test set     -972.6206 31175.87 19004.29 -2.013852  6.355149 0.9039715
##                    ACF1  Theil's U
## Training set  0.5620355         NA
## Test set     -0.2160774 0.09439248

Compare base-line model with 6 exponential smoothing models

Compare Test RMSE of holt’s linear , holt-winters’ seasonal models with the seasonal naive model(baseline) * Noticed that the simple exponential smooothing(ses) model and holt’s linear trend model actually are not better than the base model as the RMSEs of ses and holt models are greater than the base model.
* Tow models outperforms the baseline model:Holt-Winters’ additive damped method and Holt-Winters’ multiplicative damped method

modelname <- c("snaive","ses","holt", "hw-add","hw-add-damp", "hw-mul","hw-mul-damp")

as.data.frame( cbind (modelname ,
              "RMSE"= round(c(accuracy(fc.snaive, test)["Test set",2], 
                              accuracy(fit_ses, test)["Test set",2], 
                              accuracy(fit_holt, test)["Test set",2],
                              accuracy(fit_hwa, test)["Test set",2],
                              accuracy(fit_hwad, test)["Test set",2],
                              accuracy(fit_hwm, test)["Test set",2],
                              accuracy(fit_hwmd, test)["Test set",2]),2),
               "MAPE(%)"= round(c(accuracy(fc.snaive, test)["Test set",5], 
                              accuracy(fit_ses, test)["Test set",5], 
                              accuracy(fit_holt, test)["Test set",5],
                              accuracy(fit_hwa, test)["Test set",5],
                              accuracy(fit_hwad, test)["Test set",5],
                              accuracy(fit_hwm, test)["Test set",5],
                              accuracy(fit_hwmd, test)["Test set",5]),2) ))

The best model from the 7 models above is Holt-Winters’ multiplicative damped method. However the residuals of this model doesn’t pass the Ljung-Box test, indicating the residuals is not white noise and there are still patterns remains in the residuals.

checkresiduals(fit_hwmd)

## 
##  Ljung-Box test
## 
## data:  Residuals from Damped Holt-Winters' multiplicative method
## Q* = 51.941, df = 7, p-value = 5.997e-09
## 
## Model df: 17.   Total lags used: 24
plot(fit_hwmd)

State space models for exponential smoothing with ets()

Let’s see what model will be selected by R ets() function.

ETS() method without Box-Cox transformation

The model selected by ETS() is multiplicative Error with multiplicative seasonality.

fit_ets <- ets(train)
summary(fit_ets)
## ETS(M,N,M) 
## 
## Call:
##  ets(y = train) 
## 
##   Smoothing parameters:
##     alpha = 0.1055 
##     gamma = 0.2055 
## 
##   Initial states:
##     l = 221478.739 
##     s = 0.1026 0.109 0.4589 1.5244 3.0229 3.5129
##            1.9837 0.7561 0.1416 0.1114 0.1602 0.1165
## 
##   sigma:  0.2033
## 
##      AIC     AICc      BIC 
## 12117.06 12118.12 12179.28 
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 4920.755 38631.02 22640.88 -5.253336 17.35563 1.076952
##                   ACF1
## Training set 0.5785562
checkresiduals(fit_ets)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(M,N,M)
## Q* = 61.677, df = 10, p-value = 1.743e-09
## 
## Model df: 14.   Total lags used: 24

ETS() method with Box-Cox transformation

Calculate train set lambda value

lambda_t <- BoxCox.lambda(train)
lambda_t
## [1] 0.1140685

The model selected by ETS() with box-cox tranformed training data is Additive Error with additive seasonality.

fit_ets_t <- ets(train, lambda = lambda_t)
summary(fit_ets_t)
## ETS(A,N,A) 
## 
## Call:
##  ets(y = train, lambda = lambda_t) 
## 
##   Box-Cox transformation: lambda= 0.1141 
## 
##   Smoothing parameters:
##     alpha = 0.0983 
##     gamma = 0.257 
## 
##   Initial states:
##     l = 23.157 
##     s = -5.1802 -6.0519 -0.2159 4.7843 7.6906 8.1356
##            6.0903 1.7034 -3.9509 -5.1951 -3.2089 -4.6013
## 
##   sigma:  0.7546
## 
##      AIC     AICc      BIC 
## 2629.784 2630.846 2692.011 
## 
## Training set error measures:
##                    ME     RMSE      MAE      MPE     MAPE      MASE
## Training set 7037.772 34816.71 20324.62 -2.86916 16.02903 0.9667752
##                   ACF1
## Training set 0.5384073
checkresiduals(fit_ets_t)

## 
##  Ljung-Box test
## 
## data:  Residuals from ETS(A,N,A)
## Q* = 49.684, df = 10, p-value = 3.051e-07
## 
## Model df: 14.   Total lags used: 24

Compare ETS models with and withoouot box-cox transformation: * the AICc value is much lower on the ets model when the data is Box-Cox transformed.

as.data.frame(cbind("ETS() model"=c("Not-transformed", "Box-Cox-transformed") , "AICc"= c(fit_ets$aicc, fit_ets_t$aicc)))
  • the prediction error on test set(RMSE and MAPE) is lower for model with box-cox transformation
fc_ets<-forecast(fit_ets ,h=12)

fc_ets_t<-forecast(fit_ets_t ,h=12)

accuracy(fc_ets, test)
##                    ME     RMSE      MAE        MPE      MAPE     MASE
## Training set 4920.755 38631.02 22640.88 -5.2533363 17.355628 1.076952
## Test set     4366.083 42872.09 28546.23 -0.8386341  8.274478 1.357850
##                    ACF1 Theil's U
## Training set  0.5785562        NA
## Test set     -0.0326593 0.1846774
accuracy(fc_ets_t, test)
##                    ME     RMSE      MAE        MPE      MAPE      MASE
## Training set 7037.772 34816.71 20324.62 -2.8691600 16.029027 0.9667752
## Test set     6829.283 38256.02 25839.84 -0.5771483  8.057693 1.2291163
##                     ACF1 Theil's U
## Training set  0.53840729        NA
## Test set     -0.07292969 0.1667021

Based on the Test RMSE, ETS model with Box-Cox transformation is better in forecasting next 12-month visitors than the ETS model without box-cox transformation. However the residual of both models doesn’t pass the white noise test.

** Between Seasonal Naive(base model), Holt-Winters’ Multiplicative Damped method and Exponential Smoothing State Space Model (Additive seasonal component and additive error), the best model is Holt-Winters’ Multiplicative Damped method so far as it has the lowest test error (RMSE and MAPE).**

as.data.frame(cbind("Model"=c( "Seasonal Naive (base model) ", "Holt-Winters' Multi Damped", "ETS(A,N,A) Box-Cox" ) ,
                    "AICc"= round( c( 0/0,  fit_hwmd$model$aicc,  fit_ets_t$aicc) , 2) ,
                    "RMSE" = round( c(accuracy(fc.snaive, test)["Test set",2] , accuracy(fit_hwmd, test)["Test set",2], 
                                      accuracy(fc_ets_t, test)["Test set", 2]), 2),
                    "MAPE(%)" = round( c(accuracy(fc.snaive, test)["Test set",5] , accuracy(fit_hwmd, test)["Test set",5], accuracy(fc_ets_t, test)["Test set", 5]), 2) ))

Arima/sArima model

ACF and PACF plots can help us in choosing an appropriate Arima model

tsdisplay(dfts)

Confirm applying frist order of seasonal differencing gives us a staionary data

dff <- diff(train, lag = 12, differences = 1)
tsdisplay(dff, main="1st order Normal Differenced", lag.max = 48)

Interpreting the ACF and PACF plots above: * Applying 1st order seasonal differencing with lag=12, which means we have seasonal Arima model * Exponential series decaying at ACF plot and sharp cut off at lag=1 on PACF plot implying nonseasonal Auto Regressive AR(1). * Exponential decaying at lag=12, 24, 36 in PACF plot and sharp cut off at lag=12 on ACF plot implying seasonal MA(1) ACF and PACF plots imply a SARIMA (1,0,0)(0,1,1)[12] model

Arima Model 1 : (1,0,0)(0,1,1)[12]

fit_arima_1 <-Arima(train,order=c(1,0,0),seasonal=list(order=c(0,1,1),period=12),lambda = lambda_t)
summary(fit_arima_1)
## Series: train 
## ARIMA(1,0,0)(0,1,1)[12] 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##          ar1     sma1
##       0.3246  -0.6644
## s.e.  0.0468   0.0441
## 
## sigma^2 estimated as 0.5716:  log likelihood=-522.07
## AIC=1050.14   AICc=1050.2   BIC=1062.51
## 
## Training set error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE
## Training set 9346.253 31917.8 18485.82 -1.209553 15.24865 0.8793096
##                   ACF1
## Training set 0.4394491
checkresiduals(fit_arima_1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0)(0,1,1)[12]
## Q* = 43.177, df = 22, p-value = 0.004487
## 
## Model df: 2.   Total lags used: 24

Model Goodness of Fit: Arima Model 1 : (1,0,0)(0,1,1)[12]

fit_arima_1_fc <- forecast(fit_arima_1 ,h = 12)
autoplot(fit_arima_1_fc)+ xlab("Year") + ylab("Number of Visitors")

accuracy(fit_arima_1_fc, test)
##                     ME     RMSE      MAE       MPE      MAPE      MASE
## Training set  9346.253 31917.80 18485.82 -1.209553 15.248653 0.8793096
## Test set     17951.883 39817.83 23685.51  4.281207  7.001718 1.1266418
##                     ACF1 Theil's U
## Training set  0.43944909        NA
## Test set     -0.09280763  0.183352
mase(test,fit_arima_1_fc$mean)
## [1] 0.1393509

Auto.Arima

Then, let’s use auto.arima function

fit_autoarima <- auto.arima(train, stepwise = FALSE, approximation = FALSE, lambda = lambda_t) 
summary(fit_autoarima)
## Series: train 
## ARIMA(1,0,2)(1,1,1)[12] with drift 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##          ar1      ma1      ma2    sar1     sma1   drift
##       0.9540  -0.6878  -0.1822  0.1603  -0.8040  0.0041
## s.e.  0.0295   0.0548   0.0476  0.0646   0.0419  0.0020
## 
## sigma^2 estimated as 0.5405:  log likelihood=-508.26
## AIC=1030.52   AICc=1030.77   BIC=1059.38
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 5805.302 29971.91 16934.01 -3.817256 15.15294 0.805495
##                   ACF1
## Training set 0.4042169

auto.arima: (1,0,2)(1,1,1)[12]

checkresiduals(fit_autoarima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,2)(1,1,1)[12] with drift
## Q* = 25.475, df = 18, p-value = 0.1124
## 
## Model df: 6.   Total lags used: 24
autoplot(fit_autoarima)

Model Goodness of Fit : Arima Model 2 : (1,0,2)(1,1,1)[12]

fit_arima_2_fc <- forecast(fit_autoarima ,h = 12)
autoplot(fit_arima_2_fc)+ xlab("Year") + ylab("Number of Visitors")

accuracy(fit_arima_2_fc, test)
##                    ME     RMSE      MAE       MPE      MAPE     MASE
## Training set 5805.302 29971.91 16934.01 -3.817256 15.152935 0.805495
## Test set     8794.693 42860.17 28273.64  0.330978  7.573627 1.344884
##                     ACF1 Theil's U
## Training set  0.40421686        NA
## Test set     -0.08996232 0.1880815
mase(test,fit_arima_2_fc$mean)
## [1] 0.1663446

Loop for the best ARIMA model

Find the best Arima model by trying different combinations of autoregressive and moving average order and number of normal and seasonal differencing. As we dicussed during the differencing section above, we are sure that there is clear seasonal compoent within the dataset, I set seasonal differencing order to be 1. For normal differencing, I am open to see if 1-order of normal differencing will help the model predict better than without it, so normal differencing is set to either 0 or 1. The orders for AR and MA (both seasonal and normal component) are between 0, 1, 2. I limit the maximum number for any orders to 2 because I would like to keep the model as simple as possible so that the model is more interpretable.

p <- 0:2;  d <- 0:1;  q <- 0:2;  P <- 0:2;  D <- 1;  Q <- 0:2
comb <- as.matrix(expand.grid(p,d,q,P,D,Q))

aicc_list <- numeric(nrow(comb))
rmse_list <- numeric(nrow(comb))
model_arima <- rep(0,nrow(comb))
mape_list <- numeric(nrow(comb))
mase_list <- numeric(nrow(comb))


for(k in 1:nrow(comb)){
  
  arima_model <- try(Arima(train,order=c(comb[k,1], comb[k,2], comb[k,3]),
                          seasonal=list(order=c(comb[k,4], comb[k,5], comb[k,6]),
                          period=12), lambda = lambda_t), silent = TRUE)
  if(!typeof(arima_model) == "character"){
    arima_model <- Arima(train,order=c(comb[k,1], comb[k,2], comb[k,3]),
                          seasonal=list(order=c(comb[k,4], comb[k,5], comb[k,6]),
                          period=12),lambda = lambda_t)
    fc <- forecast(arima_model ,h = 12)
    model_arima[k] <- paste("ARIMA", "(", comb[k,1], "," ,  comb[k,2], ",",
           comb[k,3], ")",  "," , "(",  comb[k,4] ,",", comb[k,5],",", comb[k,6], ")" ,"[12]")
    
    aicc_list[k] <- arima_model$aicc
    rmse_list[k] <- round(accuracy(fc, test)["Test set",2])
    mape_list[k] <- round(accuracy(fc, test)["Test set",5], 2)
    mase_list[k] <- round(mase(test, fc$mean), 4)*100
  }
  else{
      print(paste("Error in Model", "ARIMA", "(", comb[k,1], "," ,  comb[k,2], ",",
           comb[k,3], ")",  "," , "(",  comb[k,4] ,",", comb[k,5],",", comb[k,6], ")" ,"[12]"))
  }
}
## [1] "Error in Model ARIMA ( 2 , 0 , 2 ) , ( 2 , 1 , 0 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 2 , 1 , 0 ) [12]"
## [1] "Error in Model ARIMA ( 2 , 1 , 2 ) , ( 2 , 1 , 0 ) [12]"
## [1] "Error in Model ARIMA ( 2 , 0 , 2 ) , ( 2 , 1 , 1 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 2 , 1 , 1 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 1 , 1 , 2 ) [12]"
## [1] "Error in Model ARIMA ( 2 , 0 , 2 ) , ( 2 , 1 , 2 ) [12]"
## [1] "Error in Model ARIMA ( 1 , 1 , 2 ) , ( 2 , 1 , 2 ) [12]"
arima_table <- as.data.frame(cbind ("ARIMA Model"=model_arima, "AICc" =round(aicc_list,2),
                                    "rmse"=rmse_list, "mape(%)"=mape_list, "mase(%)" = mase_list ) )

arima_table <- arima_table[!(apply(arima_table, 1, function(y) any(y == 0))),]

arima_table$rmse <- as.numeric(as.character(arima_table$rmse))
arima_table$`mape(%)` <- as.numeric(as.character(arima_table$`mape(%)`))
arima_table$`mase(%)` <- as.numeric(as.character(arima_table$`mase(%)`))

arima_table[with(arima_table,order(rmse)),]
arima_table[with(arima_table,order(`mase(%)`)),]

By running a loop to try different combinations of autoregressive and moving average order and number of normal and seasonal differencing, I found that the best ARIMA model is ARIMA ( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12] as it ranks the lowest in test error (using RMSE and MASE metrics).

Arima Model 3 : (0,1,1)(1,1,0)[12]

fit_arima_3 <-Arima(train,order=c(0,1,1), seasonal=list(order=c(1,1,0),period=12), lambda = lambda_t)
summary(fit_arima_3)
## Series: train 
## ARIMA(0,1,1)(1,1,0)[12] 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##           ma1     sar1
##       -0.8830  -0.4024
## s.e.   0.0289   0.0434
## 
## sigma^2 estimated as 0.667:  log likelihood=-554.4
## AIC=1114.79   AICc=1114.85   BIC=1127.16
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 1796.977 35338.46 20336.56 -4.092848 16.6471 0.9673431
##                   ACF1
## Training set 0.5350276
checkresiduals(fit_arima_3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(1,1,0)[12]
## Q* = 54, df = 22, p-value = 0.0001645
## 
## Model df: 2.   Total lags used: 24

Model Goodness of Fit: Arima Model 3 : (0,1,1)(1,1,0)[12]

fit_arima_3_fc <- forecast(fit_arima_3 ,h = 12)
autoplot(fit_arima_3_fc)+ xlab("Year") + ylab("Number of Visitors")

accuracy(fit_arima_3_fc, test)
##                    ME     RMSE      MAE        MPE     MAPE      MASE
## Training set 1796.977 35338.46 20336.56 -4.0928478 16.64710 0.9673431
## Test set     2431.471 28162.62 16351.59 -0.9826062  6.28688 0.7777915
##                    ACF1  Theil's U
## Training set  0.5350276         NA
## Test set     -0.3595344 0.08027541
mase(test,fit_arima_3_fc$mean)
## [1] 0.09620266
arima_eva <- as.data.frame(cbind("Model"=c( "Arima(1,0,0)(0,1,1)[12]", "Auto.Airma","Arima(0,1,1)(1,1,0)[12]"),
                   
                    "RMSE" = round( c(accuracy(fit_arima_1_fc, test)["Test set",2] , accuracy(fit_arima_2_fc, test)["Test set",2], 
                                      accuracy(fit_arima_3_fc, test)["Test set", 2]), 2),
                    "MAPE(%)" = round( c(accuracy(fit_arima_1_fc, test)["Test set",5] , accuracy(fit_arima_2_fc, test)["Test set",5], 
                                      accuracy(fit_arima_3_fc, test)["Test set", 5]), 2),
                    "MASE(%)" = round( c(accuracy(fit_arima_1_fc, test)["Test set",6] , accuracy(fit_arima_2_fc, test)["Test set",6], 
                                      accuracy(fit_arima_3_fc, test)["Test set", 6]), 2) ))

arima_eva

Regression model

Getting temperature and gas price data ready

# weather <- read.csv("/Users/yuanhongzhang/Downloads/2143862.csv")
# 
# #extract year and month from DATE
# weather <- weather %>% mutate(cal_TAVG = (TMAX + TMIN)/2)
# 
# weather$date <- format(as.Date(weather$DATE, format = "%Y-%m-%d"),"%Y-%m")
# 
# #Do a groupby meaning the monthly values
# weather_groupby <- weather %>% group_by(date) %>% summarise(mean=mean(cal_TAVG,na.rm=TRUE))
# 
# #create temperature dataframe and time series
# temp_df <- (as.data.frame(weather_groupby))
# temp_ts <-  ts(temp_df[['mean']] , start=c(1979, 1), end = c(2018,12),frequency=12)
# 
# save(temp_df, file= "/Users/yuanhongzhang/Documents/UChicago/Winter 2020/Time Series/Project/temperature_df.Rdata")
# save(temp_ts, file= "/Users/yuanhongzhang/Documents/UChicago/Winter 2020/Time Series/Project/temperature_ts.Rdata")

Introducing predicting variables - temerature and gas price

Load temperature data

load(file= "temperature_df.Rdata")
load(file= "temperature_ts.Rdata")
#Plot temperature time series
head(temp_df)
autoplot(temp_ts)

#Split temperature into train and test set
temp_train <- temp_df$mean[1:468]
dim(temp_train) <- c(468,1)

temp_test <- temp_df$mean[469:480]
dim(temp_test) <- c(12,1)

Load gas price data

gas_df <- read_excel("gasoline.xlsx")
head(gas_df)

Split gas price dataset into train and test and create dataframe and time series object

gas_train <- gas_df$`Unleaded Regular Gasoline - U.S. City Average Retail Price`[1:468]
dim(gas_train) <- c(468,1)

gas_test <- gas_df$`Unleaded Regular Gasoline - U.S. City Average Retail Price`[469:480]
dim(gas_test) <- c(12,1)

gas_ts <- ts(gas_df$`Unleaded Regular Gasoline - U.S. City Average Retail Price` ,  start=c(1979, 1), end = c(2018,12), frequency = 12)
autoplot(gas_ts)

gasts_train <- window(gas_ts, start = c(1979, 1), end = c(2017,12), frequency = 12)
gasts_test <- window(gas_ts, start = c(2017, 1), end = c(2018,12), frequency = 12)
plot( temp_train, train,type="p" , main ="Number of Visitors vs Temperature", pch = 16,col = "blue", xlab= "temperature", ylab="Number of Visitors")

plot( gas_train,train, type="p" , main ="Number of Visitors vs Gas Price", pch = 16,col = "blue", xlab= "GasPrice", ylab="Number of Visitors")

print("Correlation between Number of Visitors vs Temperature: " )
## [1] "Correlation between Number of Visitors vs Temperature: "
cor(train, temp_train)
##           [,1]
## [1,] 0.8785233
print("Correlation between Number of Visitors vs Gas Price: " )
## [1] "Correlation between Number of Visitors vs Gas Price: "
cor(train, gas_train) 
##           [,1]
## [1,] 0.1578205

combining visitors, temperature and gas price into one single dataframe

training_df <- as.data.frame(cbind(as.numeric(train), temp_train, gas_train))
test_df <- as.data.frame(cbind(as.numeric(test), temp_test, gas_test))

colnames(training_df) <- c("Visitors","Temperature","GasPrice")
colnames(test_df) <- c("Visitors","Temperature","GasPrice")

Check correlations between variables

training_df  %>%
  GGally::ggpairs(columns = c("Temperature","GasPrice","Visitors"))

# prepare train time series
training_ts <- ts(data = training_df, start= c(1979,1), end=c(2017,12), frequency = 12)
test_ts <- ts(data = test_df, start= c(2018,1), end=c(2018,12), frequency = 12)
head(training_ts)
##          Visitors Temperature GasPrice
## Jan 1979    23605    1.951923    0.716
## Feb 1979    31992   18.875000    0.730
## Mar 1979    17813   24.467742    0.755
## Apr 1979    34095   34.273585    0.802
## May 1979   108952   43.016949    0.844
## Jun 1979   313924   51.210526    0.901

Model Time Series Linear Model

TSLM (Visitors vs Temperature)

#tslm() model - visitors~ temperature
fit_tslm_temp <- tslm(Visitors~Temperature, data = training_ts)
summary(fit_tslm_temp)
## 
## Call:
## tslm(formula = Visitors ~ Temperature, data = training_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -277267 -111458   -3260   99869  390721 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -344694.7    16138.0  -21.36   <2e-16 ***
## Temperature   16035.9      403.9   39.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 136100 on 466 degrees of freedom
## Multiple R-squared:  0.7718, Adjusted R-squared:  0.7713 
## F-statistic:  1576 on 1 and 466 DF,  p-value: < 2.2e-16
plot(temp_train, train, type="p" , main ="Number of Visitors vs Temperature", pch = 16, col = "blue", xlab= "temperature", ylab="Number of Visitors")
abline(fit_tslm_temp$coefficients[1],fit_tslm_temp$coefficients[2], lwd=2)

checkresiduals(fit_tslm_temp, test="LB")

## 
##  Ljung-Box test
## 
## data:  Residuals from Linear regression model
## Q* = 2097.4, df = 22, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 24
fc_tslm_temp <- forecast(fit_tslm_temp, newdata=data.frame(Temperature = test_ts[,"Temperature"]), h=12)
plot(fc_tslm_temp)

accuracy(fc_tslm_temp, test)
##                         ME     RMSE      MAE        MPE     MAPE     MASE
## Training set -4.271197e-12 135803.0 113844.3 -45.930931 258.9865 5.415197
## Test set      8.998130e+04 175778.3 136693.7   6.943095 148.5744 6.502068
##                   ACF1 Theil's U
## Training set 0.5109284        NA
## Test set     0.4761760 0.9903687

TSLM (Visitors vs Gas Price)

fit_tslm_gas <- tslm(Visitors~GasPrice, data = training_ts)

summary(fit_tslm_gas)
## 
## Call:
## tslm(formula = Visitors ~ GasPrice, data = training_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -333796 -202214 -161080  225827  726749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   154226      29427   5.241 2.43e-07 ***
## GasPrice       51659      14973   3.450 0.000611 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 281300 on 466 degrees of freedom
## Multiple R-squared:  0.02491,    Adjusted R-squared:  0.02281 
## F-statistic:  11.9 on 1 and 466 DF,  p-value: 0.0006113
plot( gas_train, train, type="p" , main ="Number of Visitors vs Gas Price", pch = 16, col = "blue", xlab= "Gas Price", ylab="Number of Visitors")
abline(fit_tslm_gas$coefficients[1],fit_tslm_temp$coefficients[2], lwd=2)

checkresiduals(fit_tslm_gas, test="LB")

## 
##  Ljung-Box test
## 
## data:  Residuals from Linear regression model
## Q* = 3489.8, df = 22, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 24
fc_tslm_gas <- forecast(fit_tslm_gas, newdata=data.frame(GasPrice = test_ts[,"GasPrice"]), h=12)
plot(fc_tslm_gas)

accuracy(fc_tslm_gas, test)
##                        ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 9.653786e-12 280722.8 242925.6 -555.2211 593.0038 11.55517
## Test set     4.738567e+04 358231.1 322652.0 -531.1502 578.3562 15.34749
##                   ACF1 Theil's U
## Training set 0.7894059        NA
## Test set     0.7311646  2.278045

TSLM (Visitors vs Temperature + Gas Price)

fit_tslm_both <- tslm(Visitors~Temperature+GasPrice, data = training_ts)
summary(fit_tslm_both)
## 
## Call:
## tslm(formula = Visitors ~ Temperature + GasPrice, data = training_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -307693 -114616    -894   98003  382997 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375915.8    19503.3 -19.275  < 2e-16 ***
## Temperature   15911.5      403.4  39.441  < 2e-16 ***
## GasPrice      20305.5     7234.4   2.807  0.00521 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135100 on 465 degrees of freedom
## Multiple R-squared:  0.7756, Adjusted R-squared:  0.7746 
## F-statistic: 803.6 on 2 and 465 DF,  p-value: < 2.2e-16
checkresiduals(fit_tslm_both, test="LB")

## 
##  Ljung-Box test
## 
## data:  Residuals from Linear regression model
## Q* = 2185.1, df = 21, p-value < 2.2e-16
## 
## Model df: 3.   Total lags used: 24
fc_tslm_both <- forecast(fit_tslm_both, newdata=data.frame(Temperature = test_ts[,"Temperature"],GasPrice= test_ts[,"GasPrice"]), h=12)
plot(fc_tslm_both)

accuracy(fc_tslm_both, test)
##                         ME     RMSE      MAE       MPE     MAPE     MASE
## Training set -3.190982e-12 134667.0 112605.6 -45.26767 259.4967 5.356278
## Test set      7.029709e+04 166427.5 130448.4 -33.64160 155.7441 6.205004
##                   ACF1 Theil's U
## Training set 0.5015074        NA
## Test set     0.4728120  0.984068

Inspecting the residuals of the three LSTM models, we can see that there are still time series components within the residuals of TSLM (time series linear model), we can proceed to ARIMAX model.

tslm_eva <- as.data.frame(cbind("Model"=c( "TSLM (Visitors vs Temperature) ", "TSLM (Visitors vs Gas Price) ","TSLM (Visitors vs Temperature + Gas Price) "),
                   
                    "RMSE" = round( c(accuracy(fc_tslm_temp, test)["Test set",2] , accuracy(fc_tslm_gas, test)["Test set",2], 
                                      accuracy(fc_tslm_both, test)["Test set", 2]), 2),
                    "MAPE(%)" = round( c(accuracy(fc_tslm_temp, test)["Test set",5] , accuracy(fc_tslm_gas, test)["Test set",5], 
                                      accuracy(fc_tslm_both, test)["Test set", 5]), 2),
                    "MASE(%)" = round( c(accuracy(fc_tslm_temp, test)["Test set",6] , accuracy(fc_tslm_gas, test)["Test set",6], 
                                      accuracy(fc_tslm_both, test)["Test set", 6]), 2) ))

tslm_eva

ArimaX (regression)

Prepare xreg for Arimax

xreg_train <- cbind(temp_train, gas_train)
xreg_test <- cbind(temp_test, gas_test)

Build ARIMAX with auto.arima

ARIMAX with auto.arima model (xreg = Temperature and Gas Price)

# xreg = Temperature and gas price
fit_autoarimax_gas_temp <- auto.arima(y = train, D=1,  stepwise=FALSE , approximation=FALSE,xreg =xreg_train,  seasonal = TRUE , lambda = lambda_t)
summary(fit_autoarimax_gas_temp)
## Series: train 
## Regression with ARIMA(1,0,2)(1,1,1)[12] errors 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##          ar1      ma1      ma2    sar1     sma1   drift   xreg1    xreg2
##       0.9463  -0.6843  -0.1938  0.1757  -0.8036  0.0047  0.0419  -0.2144
## s.e.  0.0397   0.0609   0.0488  0.0645   0.0414  0.0017  0.0091   0.1284
## 
## sigma^2 estimated as 0.5156:  log likelihood=-496.38
## AIC=1010.76   AICc=1011.16   BIC=1047.86
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 5869.082 30516.22 17291.93 -3.726119 14.88046 0.8225203
##                   ACF1
## Training set 0.4103654
print( "p-values of model parameters" )
## [1] "p-values of model parameters"
(1-pnorm(abs(fit_autoarimax_gas_temp$coef)/sqrt(diag(fit_autoarimax_gas_temp$var.coef))))*2
##          ar1          ma1          ma2         sar1         sma1 
## 0.000000e+00 0.000000e+00 7.042706e-05 6.485191e-03 0.000000e+00 
##        drift        xreg1        xreg2 
## 6.605605e-03 4.060019e-06 9.483444e-02
checkresiduals(fit_autoarimax_gas_temp)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Q* = 23.772, df = 16, p-value = 0.0946
## 
## Model df: 8.   Total lags used: 24

ARIMAX (xreg = Temperature)

fit_autoarimax_temp <- auto.arima(y = train, D=1,  xreg =temp_train, stepwise=FALSE , approximation=FALSE,  seasonal = TRUE , lambda = lambda_t )
summary(fit_autoarimax_temp)
## Series: train 
## Regression with ARIMA(1,0,2)(1,1,1)[12] errors 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##          ar1      ma1      ma2    sar1     sma1   drift    xreg
##       0.9560  -0.6885  -0.1902  0.1816  -0.8066  0.0039  0.0424
## s.e.  0.0304   0.0554   0.0482  0.0642   0.0413  0.0020  0.0091
## 
## sigma^2 estimated as 0.5174:  log likelihood=-497.67
## AIC=1011.33   AICc=1011.65   BIC=1044.31
## 
## Training set error measures:
##                   ME     RMSE      MAE       MPE     MAPE     MASE
## Training set 5852.07 30926.69 17412.76 -3.701379 14.89393 0.828268
##                   ACF1
## Training set 0.4199537
print( "p-values of model parameters" )
## [1] "p-values of model parameters"
(1-pnorm(abs(fit_autoarimax_temp$coef)/sqrt(diag(fit_autoarimax_temp$var.coef))))*2
##          ar1          ma1          ma2         sar1         sma1 
## 0.000000e+00 0.000000e+00 7.887126e-05 4.678755e-03 0.000000e+00 
##        drift         xreg 
## 5.119174e-02 3.100831e-06
checkresiduals(fit_autoarimax_temp)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Q* = 24.19, df = 17, p-value = 0.1144
## 
## Model df: 7.   Total lags used: 24

ARIMAX (xreg = Gas)

fit_autoarimax_gas <- auto.arima(y = train, D=1,  xreg =gas_train, stepwise=FALSE , approximation=FALSE,  seasonal = TRUE , lambda = lambda_t )
summary(fit_autoarimax_gas)
## Series: train 
## Regression with ARIMA(1,0,2)(1,1,1)[12] errors 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##          ar1      ma1      ma2    sar1     sma1   drift     xreg
##       0.9434  -0.6838  -0.1855  0.1552  -0.8012  0.0051  -0.2410
## s.e.  0.0382   0.0598   0.0481  0.0648   0.0418  0.0018   0.1313
## 
## sigma^2 estimated as 0.538:  log likelihood=-506.72
## AIC=1029.44   AICc=1029.76   BIC=1062.42
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 5832.571 29639.49 16773.97 -3.844213 15.11283 0.7978825
##                   ACF1
## Training set 0.3971867
print( "p-values of model parameters" )
## [1] "p-values of model parameters"
(1-pnorm(abs(fit_autoarimax_gas$coef)/sqrt(diag(fit_autoarimax_gas$var.coef))))*2
##          ar1          ma1          ma2         sar1         sma1 
## 0.0000000000 0.0000000000 0.0001132871 0.0165884570 0.0000000000 
##        drift         xreg 
## 0.0041732300 0.0664623723
checkresiduals(fit_autoarimax_gas)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,0,2)(1,1,1)[12] errors
## Q* = 25.092, df = 17, p-value = 0.09267
## 
## Model df: 7.   Total lags used: 24

Evaluate model fitness for ARIMAX auto.arima

fc_autoarimax_gas_temp <- forecast(fit_autoarimax_gas_temp, xreg = xreg_test, h=12)
autoplot(fc_autoarimax_gas_temp ) + xlab("Month") + ylab("Average Visitors")

fc_autoarimax_temp <- forecast(fit_autoarimax_temp, xreg = temp_test, h=12)
autoplot(fc_autoarimax_temp ) + xlab("Month") + ylab("Average Visitors")

fc_autoarimax_gas <- forecast(fit_autoarimax_gas, xreg = gas_test, h=12)
autoplot(fc_autoarimax_gas ) + xlab("Month") + ylab("Average Visitors")

auto_arimax_eva <- as.data.frame(cbind("Model"=c( "auto.airmax (Visitors vs Temperature) ", "auto.airmax (Visitors vs Gas Price) ","auto.airmax (Visitors vs Temperature + Gas Price) "),
                   
                    "RMSE" = round( c(accuracy(fc_autoarimax_temp, test)["Test set",2] , accuracy(fc_autoarimax_gas, test)["Test set",2], 
                                      accuracy(fc_autoarimax_gas_temp, test)["Test set", 2]), 2),
                    "MAPE(%)" = round( c(accuracy(fc_autoarimax_temp, test)["Test set",5] , accuracy(fc_autoarimax_gas, test)["Test set",5], 
                                      accuracy(fc_autoarimax_gas_temp, test)["Test set", 5]), 2),
                    "MASE(%)" = round( c(accuracy(fc_autoarimax_temp, test)["Test set",6] , accuracy(fc_autoarimax_gas, test)["Test set",6], 
                                      accuracy(fc_autoarimax_gas_temp, test)["Test set", 6]), 2) ))

auto_arimax_eva

ARIMAX( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12]

Second ARIMAX model with the order from the best ARIMA model selected above : ARIMA( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12]

ARIMAX( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12] xreg = Temperature and gas price

fit_Arimax_gas_temp <- Arima(y = train, order = c(0,1,1), seasonal = list(order = c(1,1,0),  period = 12), xreg =xreg_train,   lambda = lambda_t)
summary(fit_Arimax_gas_temp)
## Series: train 
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##           ma1     sar1   xreg1    xreg2
##       -0.8998  -0.3970  0.0407  -0.1889
## s.e.   0.0307   0.0437  0.0088   0.1228
## 
## sigma^2 estimated as 0.6355:  log likelihood=-542.42
## AIC=1094.84   AICc=1094.97   BIC=1115.44
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 1973.735 36102.21 20595.89 -3.932287 16.32299 0.9796788
##                   ACF1
## Training set 0.5575194
# check p-values of model parameters
(1-pnorm(abs(fit_Arimax_gas_temp$coef)/sqrt(diag(fit_Arimax_gas_temp$var.coef))))*2
##          ma1         sar1        xreg1        xreg2 
## 0.000000e+00 0.000000e+00 3.423757e-06 1.241409e-01
checkresiduals(fit_Arimax_gas_temp)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Q* = 53.685, df = 20, p-value = 6.442e-05
## 
## Model df: 4.   Total lags used: 24

ARIMAX( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12] xreg = Temperature

# xreg = Temperature only
fit_Arimax_temp <- Arima(y = train, order = c(0,1,1), seasonal = list(order = c(1,1,0),  period = 12),  xreg =temp_train,  lambda = lambda_t)
summary(fit_Arimax_temp)
## Series: train 
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##           ma1     sar1    xreg
##       -0.8907  -0.3912  0.0413
## s.e.   0.0306   0.0437  0.0088
## 
## sigma^2 estimated as 0.6375:  log likelihood=-543.56
## AIC=1095.13   AICc=1095.21   BIC=1111.61
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2022.477 36742.88 20985.84 -3.863416 16.35694 0.9982274
##                   ACF1
## Training set 0.5634653
# check p-values of model parameters
(1-pnorm(abs(fit_Arimax_temp$coef)/sqrt(diag(fit_Arimax_temp$var.coef))))*2
##          ma1         sar1         xreg 
## 0.000000e+00 0.000000e+00 2.488569e-06
checkresiduals(fit_Arimax_temp)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Q* = 54.66, df = 21, p-value = 7.923e-05
## 
## Model df: 3.   Total lags used: 24

ARIMAX( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12] xreg = gas price

fit_Arimax_gas <- Arima(y = train, order = c(0,1,1), seasonal = list(order = c(1,1,0),  period = 12),  xreg =gas_train,  lambda = lambda_t )
summary(fit_Arimax_gas)
## Series: train 
## Regression with ARIMA(0,1,1)(1,1,0)[12] errors 
## Box Cox transformation: lambda= 0.1140685 
## 
## Coefficients:
##           ma1     sar1     xreg
##       -0.8918  -0.4077  -0.2189
## s.e.   0.0287   0.0434   0.1274
## 
## sigma^2 estimated as 0.6641:  log likelihood=-552.97
## AIC=1113.93   AICc=1114.02   BIC=1130.41
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 1736.129 34721.27 19901.44 -4.163424 16.57841 0.9466463
##                   ACF1
## Training set 0.5283516
# check p-values of model parameters
(1-pnorm(abs(fit_Arimax_gas$coef)/sqrt(diag(fit_Arimax_gas$var.coef))))*2
##        ma1       sar1       xreg 
## 0.00000000 0.00000000 0.08571859
checkresiduals(fit_Arimax_gas)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(0,1,1)(1,1,0)[12] errors
## Q* = 52.941, df = 21, p-value = 0.0001402
## 
## Model df: 3.   Total lags used: 24

Evaluate ARIMAX modesls(( 0 , 1 , 1 ) , ( 1 , 1 , 0 ) [12]) on test dataset

fc_Arimax_gas_temp <- forecast(fit_Arimax_gas_temp, xreg = xreg_test, h=12)
autoplot(fc_Arimax_gas_temp ) + xlab("Month") + ylab("Average Visitors")

fc_Arimax_temp <- forecast(fit_Arimax_temp, xreg = temp_test, h=12)
autoplot(fc_Arimax_temp ) + xlab("Month") + ylab("Average Visitors")

fc_Arimax_gas <- forecast(fit_Arimax_gas, xreg = gas_test, h=12)
autoplot(fc_Arimax_gas ) + xlab("Month") + ylab("Average Visitors")

Model fitness evaluation summary

Arimax_eva <- as.data.frame(cbind("Model"=c( "ARIMAX(0,1,1)(1,1,0)[12] (Visitors vs Temperature) ", "ARIMAX(0,1,1)(1,1,0)[12] (Visitors vs Gas Price) ","ARIMAX(0,1,1)(1,1,0)[12] (Visitors vs Temperature + Gas Price) "),
                   
                    "RMSE" = round( c(accuracy(fc_Arimax_temp, test)["Test set",2] , accuracy(fc_Arimax_gas, test)["Test set",2], 
                                      accuracy(fc_Arimax_gas_temp, test)["Test set", 2]), 2),
                    "MAPE(%)" = round( c(accuracy(fc_Arimax_temp, test)["Test set",5] , accuracy(fc_Arimax_gas, test)["Test set",5], 
                                      accuracy(fc_Arimax_gas_temp, test)["Test set", 5]), 2),
                    "MASE(%)" = round( c(accuracy(fc_Arimax_temp, test)["Test set",6] , accuracy(fc_Arimax_gas, test)["Test set",6], 
                                      accuracy(fc_Arimax_gas_temp, test)["Test set", 6]), 2) ))

Arimax_eva
arimx_table <- rbind(Arimax_eva, auto_arimax_eva)
arimx_table

Conclusion

From the summary of prediction errors of 8 models below, noted the ARIMAX(0,1,1)(1,1,0)[12] model with gas and temperature as predicting variables has the lowest RMSE, and therefore its the best model

modelname <- c("snaive",
               "Holt-Winters' Multi Damped", 
               "ETS(A,N,A) Box-Cox",
               "Arima(0,1,1)(1,1,0)[12]",
               "auto.airmax(Temperature)",
               "ARIMAX(0,1,1)(1,1,0)[12](Temperature)", 
               "ARIMAX(0,1,1)(1,1,0)[12](Gas Price)",
               "ARIMAX(0,1,1)(1,1,0)[12](Temperature + Gas Price)")

eva_final <- as.data.frame( cbind (modelname ,
              "RMSE"= round(c(accuracy(fc.snaive, test)["Test set",2], 
                              accuracy(fit_hwmd, test)["Test set",2],
                              accuracy(fc_ets_t, test)["Test set", 2],
                              accuracy(fit_arima_3_fc, test)["Test set", 2],
                              accuracy(fc_autoarimax_temp, test)["Test set",2],
                              accuracy(fc_Arimax_temp, test)["Test set",2] ,
                              accuracy(fc_Arimax_gas, test)["Test set",2], 
                              accuracy(fc_Arimax_gas_temp, test)["Test set", 2]),2),
              
               "MAPE(%)"= round(c(accuracy(fc.snaive, test)["Test set",5], 
                              accuracy(fit_hwmd, test)["Test set",5],
                              accuracy(fc_ets_t, test)["Test set", 5],
                              accuracy(fit_arima_3_fc, test)["Test set", 5],
                              accuracy(fc_autoarimax_temp, test)["Test set",5],
                              accuracy(fc_Arimax_temp, test)["Test set",5] , 
                              accuracy(fc_Arimax_gas, test)["Test set",5], 
                              accuracy(fc_Arimax_gas_temp, test)["Test set", 5]),2),
              
               "MASE(%)"= round(c(accuracy(fc.snaive, test)["Test set",6], 
                              accuracy(fit_hwmd, test)["Test set",6],
                              accuracy(fc_ets_t, test)["Test set", 6],
                              accuracy(fit_arima_3_fc, test)["Test set", 6],
                              accuracy(fc_autoarimax_temp, test)["Test set",6],
                              accuracy(fc_Arimax_temp, test)["Test set",6] , 
                              accuracy(fc_Arimax_gas, test)["Test set",6], 
                              accuracy(fc_Arimax_gas_temp, test)["Test set", 6]),2)
              ))



eva_final