ARIMAX/SARIMAX/VAR

Literature Review

Transportation demand forecasting has been thoroughly covered in the literature. In (Banerjee, Morton, and Akartunalı 2020) the authors perform a systematic review of forecasting techniques used in scheduled passenger transportation. They found papers forecasting passenger demand for transportation primarily used the following transportation methods:

Time Series

Arima, GARCH, etc

Artificial Intelligence

Neural Networks, SVM etc

Causal Methods

Regression, Logit, etc

Mixed Models

Stated Preference, Analytical Hierarchy Process, etc

Papers most commonly focused on Airport demand, followed by Railway and Bus demand. The authors found that for short term forecasting: Pickup Models, ETS, and ARIMA models were most commonly used. In long term forecasting ARIMA and Neural networks were commonly successful in modeling.

Many of the time series forecasting models use historical demand and economic data to forecast future demand. In AI models, much more data is needed, however results can often be better. Finally, in mixed models sometimes papers use expert judgement to make adjustments to forecasts by hand that would not be done by only the model.

SARIMAX Model with Capital Bikeshare and Weather Data

In this section daily capital bikeshare ridership will be modeled using both daily high temperature data and daily precipitation data. Daily bike ridership are likely highly dependent on these two variables because people are more likely to bike when the weather is dry and warm. Therefore, it makes sense to use these two time series as predictors of daily bike ridership.

Code

# combine time series into 1
df = data.frame(date = cabi$date, cabi = cabi_time, temp = noaa_temp, prcp = noaa_prcp )
df_ts = ts(df, start = 2015, frequency = 365.25)
# plot time series
autoplot(df_ts[,c(2:4)], facets = TRUE)+xlab("Date")+ylab("")+ggtitle("Weather Variables influencing Bikeshare Ridership")

Model Fitting with auto.arima

Returns and optimal model of (5,1,1)

Code

xreg = cbind(temp = df_ts[,"temp"],
             prcp = df_ts[,"prcp"])
fit = auto.arima(df_ts[,"cabi"], xreg = xreg)
summary(fit)

Series: df_ts[, "cabi"] 
Regression with ARIMA(5,1,1) errors 

Coefficients:
         ar1      ar2      ar3      ar4      ar5      ma1     temp       prcp
      0.1707  -0.1245  -0.0812  -0.0630  -0.0848  -0.7869  30.5393  -238.6874
s.e.  0.0421   0.0277   0.0271   0.0248   0.0258   0.0371   5.6064   117.7454

sigma^2 = 4299360:  log likelihood = -26168.21
AIC=52354.42   AICc=52354.48   BIC=52408.14

Training set error measures:
                   ME    RMSE      MAE       MPE     MAPE     MASE        ACF1
Training set 5.612454 2070.26 1508.775 -11.25083 26.63336 0.520278 0.002366138

Code

checkresiduals(fit)


    Ljung-Box test

data:  Residuals from Regression with ARIMA(5,1,1) errors
Q* = 1765.9, df = 572, p-value < 2.2e-16

Model df: 6.   Total lags used: 578

Fitting Model Manually

Optimal model is (0,1,2)(0,1,0) based on both AIC and BIC

Code

# Fit linear regression
fit_reg = lm(cabi~prcp+temp, data = df)
summary(fit_reg)


Call:
lm(formula = cabi ~ prcp + temp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-9385.0 -1821.3   244.8  1946.1 11958.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -209.739    204.429  -1.026    0.305    
prcp        -1077.186    148.298  -7.264 4.82e-13 ***
temp          133.023      2.874  46.293  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2784 on 2888 degrees of freedom
Multiple R-squared:  0.4279,    Adjusted R-squared:  0.4275 
F-statistic:  1080 on 2 and 2888 DF,  p-value: < 2.2e-16

Code

# residuals
res = ts(residuals(fit_reg), start = 2015)
acf(res, main = "")
title("ACF of Residuals")

Code

pacf(res, main = "")
title("PACF of Residuals")

Code

# Checking seasonal and normal differencing
res %>% diff() %>% diff(365) %>% ggtsdisplay()

Code

SARIMA.c=function(p1,p2,q1,q2,P1,P2,Q1,Q2,d1,d2,data){
  
  temp=c()
  d=1
  D=1
  s=12
  
  i=1
  temp= data.frame()
  ls=matrix(rep(NA,9*56),nrow=56)
  
  
  for (p in p1:p2)
  {
    for(q in q1:q2)
    {
      for(P in P1:P2)
      {
        for(Q in Q1:Q2)
        {
          for(d in d1:d2)
       
        {
          if(p+d+q+P+D+Q<=8)
          {
            
            model<- Arima(data,order=c(p-1,d,q-1),seasonal=c(P-1,D,Q-1))
            ls[i,]= c(p-1,d,q-1,P-1,D,Q-1,model$aic,model$bic,model$aicc)
            i=i+1
            #print(i)
            
          }
          
        }
      }
    }
    
  }
  
  }
  temp= as.data.frame(ls)
  names(temp)= c("p","d","q","P","D","Q","AIC","BIC","AICc")
  
  temp
  
}

##q=1,3 Q=1 , p=1,2, P=1,2 d=0,1 

output=SARIMA.c(p1=1,p2=5,q1=1,q2=5,P1=1,P2=5,Q1=1,Q2=5,d1=0,d2=2,data=res)
# Optimal AIC
output[which.min(output$AIC),]

   p d q P D Q      AIC      BIC     AICc
31 0 1 2 0 1 0 52736.64 52754.55 52736.65

Code

# Optimal BIC
output[which.min(output$BIC),]

   p d q P D Q      AIC      BIC     AICc
31 0 1 2 0 1 0 52736.64 52754.55 52736.65

Code

# model diagnostics
fit_opt = Arima(res,  order=c(0,1,2),seasonal=c(0,1,0))
checkresiduals(fit_opt)


    Ljung-Box test

data:  Residuals from ARIMA(0,1,2)
Q* = 67.952, df = 8, p-value = 1.255e-11

Model df: 2.   Total lags used: 10

Cross Validation

Optimal model is (0,1,2,0,1,0) after comparing two best models using cross validation

Code

cv_1 <- function(x, h){forecast(Arima(x, order=c(0,1,2), seasonal = c(0,1,0)), h=h)}
e1 = tsCV(res,cv_1, h = 1)
paste("RMSE 1 Step Ahead (0,1,2,0,1,0):", round(sqrt(mean(e1^2, na.rm =T)),1))

[1] "RMSE 1 Step Ahead (0,1,2,0,1,0): 2225.2"

Code

cv_2 <- function(x, h){forecast(Arima(x, order=c(0,1,1), seasonal = c(0,1,0)), h=h)}
e2 = tsCV(res,cv_2, h = 1)
paste("RMSE 1 Step Ahead Manual Fit (0,1,1,0,1,0):", round(sqrt(mean(e2^2, na.rm =T)),1))

[1] "RMSE 1 Step Ahead Manual Fit (0,1,1,0,1,0): 2244.6"

Fitting With Arima Model

Code

fit = Arima(df_ts[,"cabi"], order = c(0,1,2), seasonal = c(0,1,0), xreg = xreg)
summary(fit)

Series: df_ts[, "cabi"] 
Regression with ARIMA(0,1,2)(0,1,0)[365] errors 

Coefficients:
          ma1      ma2     temp       prcp
      -0.7112  -0.1869  39.8078  -359.8156
s.e.   0.0230   0.0224   5.7249   123.3157

sigma^2 = 8161143:  log likelihood = -23677.58
AIC=47365.17   AICc=47365.19   BIC=47394.34

Training set error measures:
                    ME     RMSE      MAE       MPE     MAPE      MASE
Training set -2.253237 2667.705 1861.564 -9.341714 30.24737 0.6419317
                      ACF1
Training set -0.0005653654

Forecasting

This is the 90 day forecast for capital bikeshare usage based on weather data. The trend is definitely what we would expect to see as it continues the seasonality of previous years and lines up with ridership we would expect to see between December and February. The confidence bands are still quite large indicating that there is still significant variation in the data.

Code

# Forecast temp
temp_fit = auto.arima(noaa_temp)

Warning: The time series frequency has been rounded to support seasonal
differencing.

Code

temp_fcast = forecast(temp_fit, 90)
# forecast prcp
prcp_fit = auto.arima(noaa_prcp)
prcp_fcast = forecast(prcp_fit, 90)

# combine input variable forecasts
fxreg = cbind(temp = temp_fcast$mean,
              prcp = prcp_fcast$mean)

fcast_cabi = forecast(fit, xreg = fxreg, h = 90)
autoplot(fcast_cabi) + xlab("Date")+ylab("Ridership")+ggtitle("90 Day Capital Bikeshare Usage Forecast")

VAR Model with Capital Bikeshare, Metro Rail, and Metro Bus

Here we will model a multivariate time series consisting of Capital Bikeshare monthly ridership, metro rail monthly time series, and metro bus monthly time series. The time period is limited between January 2015 and November 2022 to ensure there is no missing data for any of the time series. It makes sense to model these time series together because each one measures ridership of a mode of public transportation in Washington DC. It is reasonable to assume that the ridership of one mode of transportation affects ridership of other modes of transportation and that changes in trend of ridership for each mode likely change together as well.

Exploratory Analysis

Code

# Create var data by combining three timeseries
var_data = window(ts.union(cabi_month_ts, metro_ts, bus_ts), start = 2015, end = c(2022, 11))
# Plot Timeseries of all 3 variables
autoplot(var_data, facets = T) + labs(title = "Ridership Time Series Plots")+theme_bw()

Code

# Plot Pairs
pairs(cbind(Bikeshare = cabi_month$month_count, Metro = metro$Riders_Total, Bus = bus$Riders_Total))

Optimal Lag (p)

Using the VARselect function we find that the choices for p are 1,4,14.

Code

VARselect(var_data, lag.max = 14, type = "const")

$selection
AIC(n)  HQ(n)  SC(n) FPE(n) 
    14      4      1     14 

$criteria
                  1            2            3            4            5
AIC(n) 7.665936e+01 7.648081e+01 7.619044e+01 7.588474e+01 7.598135e+01
HQ(n)  7.680168e+01 7.672987e+01 7.654625e+01 7.634729e+01 7.655064e+01
SC(n)  7.701409e+01 7.710159e+01 7.707727e+01 7.703762e+01 7.740028e+01
FPE(n) 1.962643e+33 1.643449e+33 1.232355e+33 9.119470e+32 1.011897e+33
                  6            7            8            9           10
AIC(n) 7.594599e+01 7.602083e+01 7.590571e+01 7.588557e+01 7.561236e+01
HQ(n)  7.662203e+01 7.680361e+01 7.679523e+01 7.688183e+01 7.671537e+01
SC(n)  7.763097e+01 7.797186e+01 7.812279e+01 7.836870e+01 7.836154e+01
FPE(n) 9.874599e+32 1.080546e+33 9.829609e+32 9.894104e+32 7.789432e+32
                 11           12           13           14
AIC(n) 7.539374e+01 7.531991e+01 7.517538e+01 7.506961e+01
HQ(n)  7.660349e+01 7.663641e+01 7.659862e+01 7.659959e+01
SC(n)  7.840897e+01 7.860119e+01 7.872271e+01 7.888300e+01
FPE(n) 6.532922e+32 6.397669e+32 5.908258e+32 5.753409e+32

Fit all potential models

There is still residual serial correlation in all the models, however p=4 is the best.

Code

var1 <- VAR(var_data, p=1, type="const")
var4 <- VAR(var_data, p=4, type="const")
var14 <- VAR(var_data, p=14, type="const")
#summary(var1)
serial.test(var1, lags.pt=10, type="PT.asymptotic")


    Portmanteau Test (asymptotic)

data:  Residuals of VAR object var1
Chi-squared = 180.13, df = 81, p-value = 1.679e-09

Code

#summary(var4)
serial.test(var4, lags.pt=10, type="PT.asymptotic")


    Portmanteau Test (asymptotic)

data:  Residuals of VAR object var4
Chi-squared = 84.296, df = 54, p-value = 0.005218

Code

#summary(var14)
serial.test(var14, lags.pt=14, type="PT.asymptotic")


    Portmanteau Test (asymptotic)

data:  Residuals of VAR object var14
Chi-squared = 91.547, df = 0, p-value < 2.2e-16

Code

ggAcf(residuals(var4)) +  ggtitle("ACF of Var(4) Residuals") + theme_bw()

Cross Validation

By Performing cross validation we find that models with p = 1 and p = 4 have signficantly smaller error than p = 14. Model 1 outperforms Model 2 slightly, however due to the serial correlation of model 1, we will choose model 2 (p=4) as the optimal model.

Code

errors1 = matrix(NA, 58, 3)
errors2 = matrix(NA, 58,3)
errors3 = matrix(NA, 58,3)
st = tsp(var_data)[1] + 3
for (i in 1:58) { 
  xtrain <- window(var_data, end=st + i/12-1)
  xtest <- window(var_data, start=st + (i/12-1/12) + 1/12, end=st + i/12)
  fit <- VAR(xtrain, p=1, type='const')
  fcast <- predict(fit, n.ahead = 1)
  cabi_fcast = fcast$fcst$cabi_month_ts[,1]
  metro_fcast = fcast$fcst$metro_ts[,1]
  bus_fcast = fcast$fcst$bus_ts[,1]
  errors1[i,1] = sqrt((cabi_fcast-xtest[,1])^2)
  errors1[i,2] = sqrt((metro_fcast-xtest[,2])^2)
  errors1[i,3] = sqrt((bus_fcast-xtest[,3])^2)
  fit2 <- VAR(xtrain, p=4, type='const')
  fcast <- predict(fit2, n.ahead = 1)
  cabi_fcast = fcast$fcst$cabi_month_ts[,1]
  metro_fcast = fcast$fcst$metro_ts[,1]
  bus_fcast = fcast$fcst$bus_ts[,1]
  errors2[i,1] = sqrt((cabi_fcast-xtest[,1])^2)
  errors2[i,2] = sqrt((metro_fcast-xtest[,2])^2)
  errors2[i,3] = sqrt((bus_fcast-xtest[,3])^2)
  fit3 <- VAR(xtrain, p=14, type='const')
  fcast <- predict(fit3, n.ahead = 1)
  cabi_fcast = fcast$fcst$cabi_month_ts[,1]
  metro_fcast = fcast$fcst$metro_ts[,1]
  bus_fcast = fcast$fcst$bus_ts[,1]
  errors3[i,1] = sqrt((cabi_fcast-xtest[,1])^2)
  errors3[i,2] = sqrt((metro_fcast-xtest[,2])^2)
  errors3[i,3] = sqrt((bus_fcast-xtest[,3])^2)
}

date = as.Date(time(ts(errors1, start = 2018, frequency =12)))
errors1 = data.frame(date, errors1)
date = as.Date(time(ts(errors2, start = 2018, frequency =12)))
errors2 = data.frame(date, errors2)
date = as.Date(time(ts(errors3, start = 2018, frequency =12)))
errors3 = data.frame(date, errors3)

Code

plot(errors1$date, errors1$X1, type = "l", col = "red", main = "Errors for Bikeshare Usage", xlab = "Date", ylab = "Error")
lines(errors2$date, errors2$X1, type = "l", col = "blue")
lines(errors3$date, errors3$X1, type = "l", col = "green")
legend("topleft",legend = c("Fit1", "Fit2", "Fit3"), col = c("red", "blue", "green"), lty =1)

Code

plot(errors1$date, errors1$X2, type = "l", col = "red", main = "Errors for Metro Usage", xlab = "Date", ylab = "Error")
lines(errors2$date, errors2$X2, type = "l", col = "blue")
lines(errors3$date, errors3$X2, type = "l", col = "green")
legend("topleft",legend = c("Fit1", "Fit2", "Fit3"), col = c("red", "blue", "green"), lty =1)

Code

plot(errors1$date, errors1$X3, type = "l", col = "red", main = "Errors for Bus Usage", xlab = "Date", ylab = "Error")
lines(errors2$date, errors2$X3, type = "l", col = "blue")
lines(errors3$date, errors3$X3, type = "l", col = "green")
legend("topleft",legend = c("Fit1", "Fit2", "Fit3"), col = c("red", "blue", "green"), lty =1)

Code

x1 = errors1 %>% 
  summarise(Bikeshare_mean_error = mean(X1),Metro_mean_error = mean(X2), Bus_mean_error = mean(X3))
x2 = errors2 %>% 
  summarise(Bikeshare_mean_error = mean(X1),Metro_mean_error = mean(X2), Bus_mean_error = mean(X3))
x3 =errors3 %>% 
  summarise(Bikeshare_mean_error = mean(X1,na.rm = T),Metro_mean_error = mean(X2,na.rm=T), Bus_mean_error = mean(X3,na.rm =T))

errors = data.frame(rbind(x1[1,],x2[1,],x3[1,]))
errors$fit = c("Fit1", "Fit2","Fit3")
errors

  Bikeshare_mean_error Metro_mean_error Bus_mean_error  fit
1             77263.56          4948463        3288663 Fit1
2             78347.48          5014544        3193301 Fit2
3            119260.26          9007298        4711002 Fit3

Forecasting

Here we plot the 1 year forecasts for each time series using the optimal var model (p=4). Both Metro Rail and Metro Bus ridership are expected to increase over the next year which makes sense because they are still recovering from the decrease in demand during the pandemic. The capital bikeshare forecast captures the seasonality of that time series and is expected to remain similar in ridership to the previous years. Seasonality is not seen in the metro rail or metro bus forecast. The confidence bands are smaller than in the previous univariate time series model as we can now explain more of the variation in the data.

Code

fit= VAR(var_data, p =4, type = "const")
autoplot(forecast(fit, h = 12))+theme_bw()

References

Banerjee, Nilabhra, Alec Morton, and Kerem Akartunalı. 2020. “Passenger Demand Forecasting in Scheduled Transportation.” European Journal of Operational Research 286 (3): 797–810. https://doi.org/https://doi.org/10.1016/j.ejor.2019.10.032.