2.3 Trend Extraction

2.3.1 Linear trend model

The most obvious thing to do is to see if a linear trend might fit the data. In this case it’s likely to be a poor approximation, as the previous plot should make obvious. However, it will be an instructive benchmark for the finer modeling choices we’ll subsequently make.

We’ll use the built-in function lm() What is happening under the hood is a simple linear regression, which you don’t need to know much about at this stage. Specifically, we will use the OLS method. which relates a predictand y to a matrix of predictors X, called the design matrix.

First, we’ll omit the rows with NA values

df1 <- na.omit(df1)

CO2model <- lm(CO2~Date, data = df1)
summary(CO2model)
## 
## Call:
## lm(formula = CO2 ~ Date, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2678 -3.6110 -0.9021  3.1466 12.6589 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.887e+03  1.826e+01  -158.1   <2e-16 ***
## Date         1.630e+00  9.171e-03   177.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.88 on 785 degrees of freedom
## Multiple R-squared:  0.9758, Adjusted R-squared:  0.9757 
## F-statistic: 3.159e+04 on 1 and 785 DF,  p-value: < 2.2e-16

For more information about what each term means, see this post

What information can you glean from this summary? What features would you check for in a visual inspection?

Let us look at the plot, shall we? How well does the time variable predict CO2?

df1 <-df1 %>% 
  mutate(linear_model = CO2model$fitted.values)

ggplot(data=df1, mapping=aes(x=Date, y=CO2)) + 
  geom_line() +
  geom_line(data=df1, mapping=aes(x=Date, y=linear_model), inherit.aes = F, color="red") +
  labs(title = "Mauna Loa Dataset (linear model)",
       y="CO2 (ppm)") +
  theme_light()

We see that the line captures the first order behavior of the series, but that leaves a lot to be desired. To zoom in, we can look at the residuals (original values - fitted values) as a function of fitted values:

df1 <-df1 %>% 
  mutate(linear_model_resids = CO2 - CO2model$fitted.values)

ggplot(data=df1, mapping=aes(x=Date, y=linear_model_resids)) + 
  geom_line() +
  labs(title = "Linear Model Residuals Fit",
       y="Residuals") +
  theme_light()

To be quantitative, we can look at the Root Mean Squared Error of these residuals:

sqrt(mean((df1$CO2 - CO2model$fitted.values)^2))
## [1] 4.874121

This means that this linear trend is, on average, about 5ppm away from the measurements. Can we do better?

2.3.2 Quadratic fit

The second thing one might do is to add a quadratic term.

CO2_quadratic_model <- lm(CO2 ~ Date + I(Date^2), data = df1)

summary(CO2_quadratic_model)
## 
## Call:
## lm(formula = CO2 ~ Date + I(Date^2), data = df1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.929 -1.781  0.099  1.814  5.013 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.035e+04  9.756e+02   51.61   <2e-16 ***
## Date        -5.184e+01  9.800e-01  -52.90   <2e-16 ***
## I(Date^2)    1.343e-02  2.461e-04   54.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.229 on 784 degrees of freedom
## Multiple R-squared:  0.9949, Adjusted R-squared:  0.9949 
## F-statistic: 7.718e+04 on 2 and 784 DF,  p-value: < 2.2e-16
df1 <-df1 %>% 
  mutate(quadratic_model = CO2_quadratic_model$fitted.values)

ggplot(data=df1, mapping=aes(x=Date, y=CO2)) + 
  geom_line() +
  geom_line(data=df1, mapping=aes(x=Date, y=quadratic_model), inherit.aes = F, color="red") +
  labs(title = "Mauna Loa Dataset (quadratic model)",
       y="CO2 (ppm)") +
  theme_light()

Let’s see the residuals and the RMSE for this model

sqrt(mean((df1$CO2 - CO2_quadratic_model$fitted.values)^2))
## [1] 2.225216
df1 <-df1 %>% 
  mutate(quadratic_model_resids = CO2 - CO2_quadratic_model$fitted.values)

ggplot(data=df1, mapping=aes(x=Date, y=quadratic_model_resids)) + 
  geom_line() +
  labs(title = "Quadratic Model Residuals Fit",
       y="Residuals") +
  theme_light()

Does this improve the fit? By how much have we shrunk the RMSE of residuals?

It should be obvious from the simple visual inspection of the original series that there is a strong periodic component we are not currently capturing. It turns out to the a simple seasonal cycle: in spring/summer, the growth of terrestrial biomass in the Northern hemisphere sequesters CO2 away from the atmosphere, so the atmospheric CO2 concentrations goes down; the reverse happens in the Fall/Winter, where the fallen leaves are degraded and their organic carbon returned to mineral (CO2) form via respiration.

2.3.3 Harmonic waves

Let us thus define harmonic waves (sines and cosines) with a period of 1 year and add them to the design matrix.

xc<-cos(2*pi*df1$Date)
xs<-sin(2*pi*df1$Date)
fit.lm <- lm(CO2~Date + I(Date^2) + xc + xs, data = df1)

summary(fit.lm)
## 
## Call:
## lm(formula = CO2 ~ Date + I(Date^2) + xc + xs, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.61346 -0.66340 -0.00194  0.61067  2.55221 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.042e+04  4.108e+02  122.72   <2e-16 ***
## Date        -5.191e+01  4.127e-01 -125.80   <2e-16 ***
## I(Date^2)    1.344e-02  1.036e-04  129.75   <2e-16 ***
## xc          -9.994e-01  4.737e-02  -21.10   <2e-16 ***
## xs           2.675e+00  4.729e-02   56.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9388 on 782 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.9991 
## F-statistic: 2.185e+05 on 4 and 782 DF,  p-value: < 2.2e-16
sqrt(mean((df1$CO2 - fit.lm$fitted.values)^2))
## [1] 0.9358151
df1 <-df1 %>% 
  mutate(harmonic_model = fit.lm$fitted.values)

ggplot(data=df1, mapping=aes(x=Date, y=CO2)) + 
  geom_line() +
  geom_line(data=df1, mapping=aes(x=Date, y=harmonic_model), inherit.aes = F, color="red") +
  labs(title = "Mauna Loa Dataset (quadratic + harmonic model)",
       y="CO2 (ppm)") +
  theme_light()

And the residuals…

df1 <-df1 %>% 
  mutate(harmonic_model_resids = CO2 - fit.lm$fitted.values)

ggplot(data=df1, mapping=aes(x=Date, y=harmonic_model_resids)) + 
  geom_line() +
  labs(title = "Quadratic+Harmonic Model Residuals Fit",
       y="Residuals") +
  theme_light()

This should look better. Does it? By how much have we shrunk the RMSE of residuals (compared to the linear model)?

100*(1-sqrt(mean((df1$CO2 - fit.lm$fitted.values)^2))/sqrt(mean((df1$CO2 - CO2model$fitted.values)^2)))
## [1] 80.80033

80% improvement in the RMSE, that’s a big improvement!

2.3.4 Automated model

The forecast package can figure this out for us.

arima_model <- forecast::auto.arima(df1$CO2, stepwise = F, approximation = F)

summary(arima_model)
## Series: df1$CO2 
## ARIMA(4,1,1) with drift 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4      ma1   drift
##       1.4940  -0.8538  0.0971  -0.0886  -0.8898  0.1369
## s.e.  0.0375   0.0645  0.0643   0.0369   0.0126  0.0077
## 
## sigma^2 = 0.4665:  log likelihood = -814.05
## AIC=1642.1   AICc=1642.24   BIC=1674.77
## 
## Training set error measures:
##                         ME      RMSE       MAE          MPE      MAPE      MASE
## Training set -0.0001132448 0.6799978 0.5402048 -0.003388041 0.1510635 0.4803613
##                     ACF1
## Training set -0.02630124
df1 <-df1 %>% 
  mutate(arima_model = arima_model$fitted)

ggplot(data=df1, mapping=aes(x=Date, y=CO2)) + 
  geom_line() +
  geom_line(data=df1, mapping=aes(x=Date, y=arima_model), inherit.aes = F, color="red") +
  labs(title = "Mauna Loa Dataset (arima_model)",
       y="CO2 (ppm)") +
  theme_light()

This is an improvement on ou best RMSE with little work, but it’s a black box. This model is probably overfit, but we could turn some knobs to help prevent that if we spent some time with it.

Let’s make some use of our newfound time series modeling skills.