Chapter 1: Introduction to Timeseries#

by Julien Emile-Geay, University of Southern California

Background#

Recall that a timeseries \(X(t)\) is an ordered data sequence, either continuous or discrete. On a computer, everything is discretized, so regardless of the original process, what we have is a discrete sequence \(X_1, \cdots, X_n\), with each \(X_i\) corresponding to times \(t_i\) in \(t_1, \cdots, t_n\).

In timeseries analysis, it is common to consider \(X(t)\) as the sum of one or more oscillatory, stationary components (or ``signals’’); a trend \(q(t)\), and noise \(\varepsilon_t\). The oscillatory components are often the reason we’re doing the analysis in the first place – it will tell us how big they are and what their frequency is.

Whether or not those oscillatory components are present, a trend is often present as well. In today’s world the word “trend” is so overused that it has lost all meaning. Here, it will refer to a slowly-evolving, non-stationary component that dominates the behavior of the timeseries. For instance: a linear increase or decrease; a sinusoidal component so slow that its period is not resolved by the dataset; a nonlinear trend like the exponential increase of the Keeling curve (see below).

As for noise, it clearly involves a subjective definition. Under this name we usually subsume any variable component in which we are not interested. Some of this noise may be composed of actual measurement errors (what you would think of as noise, strictly speaking), but some of it could be what another analyst would call “signal””. If you study climate, daily fluctuations are noise; if you study weather, they are your signal. One commonly says that one analyst’s signal is another analyst’s noise. This noise is often modeled as a Normal random process with zero mean, \(\varepsilon \sim \mathcal{N}(0,\sigma^2)\) (aka “Gaussian white noise”) .

To see these concepts in action, let us consider the iconic Keeling curve from the Mauna Loa Observatory \(CO_2\) measurements. The data are available here. In addition to the usual scientific Python stack, we will be using pandas and statsmodels. Let us load these packages:

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Data Wrangling#

We use pandas to load the data, which are in csv. Pandas can load directly over the network, so all we need to give it is a URL of the file (here, https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv). Let’s try it:

url = 'https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv'
df = pd.read_csv(url)
df.head() # this will show the first few rows of the dataset
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[2], line 2
      1 url = 'https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv'
----> 2 df = pd.read_csv(url)
      3 df.head()

File ~/opt/miniconda3/envs/pyleo/lib/python3.11/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/opt/miniconda3/envs/pyleo/lib/python3.11/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
    614     return parser
    616 with parser:
--> 617     return parser.read(nrows)

File ~/opt/miniconda3/envs/pyleo/lib/python3.11/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
   1741 nrows = validate_integer("nrows", nrows)
   1742 try:
   1743     # error: "ParserBase" has no attribute "read"
   1744     (
   1745         index,
   1746         columns,
   1747         col_dict,
-> 1748     ) = self._engine.read(  # type: ignore[attr-defined]
   1749         nrows
   1750     )
   1751 except Exception:
   1752     self.close()

File ~/opt/miniconda3/envs/pyleo/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)

File parsers.pyx:843, in pandas._libs.parsers.TextReader.read_low_memory()

File parsers.pyx:904, in pandas._libs.parsers.TextReader._read_rows()

File parsers.pyx:879, in pandas._libs.parsers.TextReader._tokenize_rows()

File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 35, saw 4

Question#

Q: What went wrong? To debug, you’ll have to download the file and look at it (do it). What do you see that a machine might have problems with?

Hint: https://www.askpython.com/python-modules/pandas/pandas-read-csv-with-headers

A: This csv file has a long header. pandas needs to be told how many lines to skip.

Another issue is that the name of the columns for the dataset span two rows, which pandas won’t be able to deal with on its own. So we need to give it 2 pieces of information:

  1. How long the header is

  2. What the column names are (which we can lift from the header manually)

We do this by passing two extra arguments to read_csv():

cols = ['Yr','Mn','XLDate','Date','CO2','seasonally adjusted','fit','seasonally adjusted fit','CO2 filled',	'seasonally adjusted filled','Sta']
df = pd.read_csv(url,header=64, names=cols) # read in the data
df.head() # display the top
Yr Mn XLDate Date CO2 seasonally adjusted fit seasonally adjusted fit CO2 filled seasonally adjusted filled Sta
0 1958 2 21231 1958.1260 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 MLO
1 1958 3 21259 1958.2027 315.71 314.44 316.20 314.91 315.71 314.44 MLO
2 1958 4 21290 1958.2877 317.45 315.16 317.30 314.99 317.45 315.16 MLO
3 1958 5 21320 1958.3699 317.51 314.69 317.89 315.07 317.51 314.69 MLO
4 1958 6 21351 1958.4548 -99.99 -99.99 317.27 315.15 317.27 315.15 MLO

By default, pandas indexed the rows by their number (starting at 0 as it’s Python). Let’s extract the actual dates, convert them to a usable format, and re-index the dataframe:

date = pd.to_datetime(df['XLDate'], unit='D', origin='1899-12-30') # convert from Excel's weird standard to something human-readable
date.name = None
df.index=date
df.head()
Yr Mn XLDate Date CO2 seasonally adjusted fit seasonally adjusted fit CO2 filled seasonally adjusted filled Sta
1958-02-15 1958 2 21231 1958.1260 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 MLO
1958-03-15 1958 3 21259 1958.2027 315.71 314.44 316.20 314.91 315.71 314.44 MLO
1958-04-15 1958 4 21290 1958.2877 317.45 315.16 317.30 314.99 317.45 315.16 MLO
1958-05-15 1958 5 21320 1958.3699 317.51 314.69 317.89 315.07 317.51 314.69 MLO
1958-06-15 1958 6 21351 1958.4548 -99.99 -99.99 317.27 315.15 317.27 315.15 MLO

Next, let’s extract the column that we want to play with and make a new data frame out of it:

co2 = df[['Date','CO2 filled']]
co2.rename(columns={'Date': 'Year', 'CO2 filled': 'CO2'},inplace=True) # rename columns for clarity
co2.head()
Year CO2
1958-02-15 1958.1260 -99.99
1958-03-15 1958.2027 315.71
1958-04-15 1958.2877 317.45
1958-05-15 1958.3699 317.51
1958-06-15 1958.4548 317.27

One last thing: the first and last value are missing, and they are encoded as -99, which is rather annoying (but unfortunately common; NaN would be a much less disruptive choice), so let’s restrict our analysis to a safe, 66y interval:

co2 = co2.loc['1958-03-15':'2024-02-15']

Now we can sit back and admire our work.

Question#

Q: plot CO2 as a function of time

Hint: use pandas’s plot function.

A:

co2['CO2'].plot(linewidth=1,ylabel='CO2 [ppm]',title="Mauna Loa Dataset (latest)")
<Axes: title={'center': 'Mauna Loa Dataset (latest)'}, ylabel='CO2 [ppm]'>
../_images/27343a1650e79f07819af7c5da05a592aa3b1853aeb89d6b230fe3060f72a290.png

Now we can sit back and admire our work.

Question#

Q: plot CO2 as a function of time

Hint: use pandas’s plot function.

A:

2. Trend Extraction#

2.1 Linear trend model#

The most obvious thing to do is to see if a linear trend might fit the data. In this case it’s likely to be a poor approximation, as the previous plot should make obvious. However, it will be an instructive benchmarks for the finer modeling choices we’ll subsequently make.

Let’s first load the statsmodels API (=Application Programming Interface) to use its packages. What is happening under the hood is a simple linear regression, which you don’t need to know much about at this stage. Specifically, we will use the OLS method. which relates a predictand \(y\) to a matrix of predictors \(X\), called the design matrix.

t = co2['Year']
y = co2['CO2']
X = np.vander(t, 2) # set up design matrix with 1, t as predictors

import statsmodels.api as sm
lin_mdl = sm.OLS(y, X).fit() # fit the OLS model to the observations

Let’s pause and contemplate this, for a moment.

First, we set up the design matrix X as a Vandermonde matrix; there is nothing particularly profound here; it is just a nice way to set this up in one function call. Second, we fit the model and that just involved importing a module and calling a fit() function. You have NO IDEA how lucky you are to have grown up in an age where fitting statistical models is that easy. Just like democracy, this is someting that has cost the blood, sweat and tears of countless generations before you, so you had better be grateful!

But there’s more. stasmodels is an example of object-oriented programming. It means that every model object created by this code enjoys a wide array of properties and utilities.

The first is a stunningly gorgeous summary of the model fit:

print(lin_mdl.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CO2   R-squared:                       0.976
Model:                            OLS   Adj. R-squared:                  0.976
Method:                 Least Squares   F-statistic:                 3.162e+04
Date:                Tue, 21 May 2024   Prob (F-statistic):               0.00
Time:                        15:37:39   Log-Likelihood:                -2382.0
No. Observations:                 792   AIC:                             4768.
Df Residuals:                     790   BIC:                             4777.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             1.6262      0.009    177.832      0.000       1.608       1.644
const      -2879.4377     18.210   -158.127      0.000   -2915.183   -2843.693
==============================================================================
Omnibus:                       46.565   Durbin-Watson:                   0.066
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               45.317
Skew:                           0.535   Prob(JB):                     1.44e-10
Kurtosis:                       2.524   Cond. No.                     2.08e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.08e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

For more information about what each term means, see this post

What information can you glean from this summary? What features would you check for in a visual inspection?

[ANSWER HERE]

Let us look at the plot, shall we? How well does the time variable predict \(CO_2\)?

Next, we can use the predict() method to predict CO2 given the design matrix (1, \(t\))

yhat = lin_mdl.predict(X) 
plt.plot(t, y, label="MLO data",linewidth=1)
plt.plot(t, yhat, label=r"$CO_2 = \beta_0 + \beta_1 t$", color='black', linewidth='2')
plt.ylabel(r"$CO_2$")
plt.xlabel("Date")
plt.legend()
<matplotlib.legend.Legend at 0x176952b30>
../_images/ef164aca4672025b4fb19d22b38129dc00cf7b270ad77c66742cec70b0e4278e.png

We see that the line captures the first order behavior of the series, but that leaves a lot to be desired. To zoom in, we can look at the residuals (original values - fitted values) as a function of fitted values:

fig, ax = plt.subplots()
ax.plot(lin_mdl.fittedvalues,lin_mdl.resid,linewidth=1)
ax.set_title('Residuals vs Fitted Values')
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
Text(0, 0.5, 'Residuals')
../_images/a11c39feb6369037d31ccf903992112c647c04ce2e27a4e486c4882344b7c6cb.png

To be quantitative, we can look at the Root Mean Squared Error of these residuals:

print('RMSE = ' + '{:4.4f}'.format(np.sqrt(lin_mdl.mse_resid)) + ' ppm')
RMSE = 4.9033 ppm

This means that this linear trend is, on average, about 5ppm away from the measurements. Can we do better?

2.2. Quadratic fit#

The second thing one might do is to add a quadratic term. Again, the Vandermonde matrix makes this very easy:

X = np.vander(t, 3) # set up design matrix with 1, t, and t^2 as predictors
quad_mdl = sm.OLS(y, X).fit()
print(quad_mdl.summary())
# plot predicted values
yhat = quad_mdl.predict(X) 
plt.plot(t, y, label="Raw Data",linewidth=1)
plt.plot(t, yhat, label=r"$CO_2 = \beta_0 + \beta_1 t + \beta_2 t^2$", color='black', linewidth='2')
plt.ylabel(r"$CO_2$")
plt.xlabel("Date")
plt.legend()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CO2   R-squared:                       0.995
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                 7.812e+04
Date:                Tue, 21 May 2024   Prob (F-statistic):               0.00
Time:                        15:38:01   Log-Likelihood:                -1756.7
No. Observations:                 792   AIC:                             3519.
Df Residuals:                     789   BIC:                             3533.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0134      0.000     55.121      0.000       0.013       0.014
x2           -51.8912      0.971    -53.446      0.000     -53.797     -49.985
const        5.04e+04    966.564     52.140      0.000    4.85e+04    5.23e+04
==============================================================================
Omnibus:                      118.499   Durbin-Watson:                   0.317
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.044
Skew:                          -0.108   Prob(JB):                     2.99e-07
Kurtosis:                       2.071   Cond. No.                     4.84e+10
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+10. This might indicate that there are
strong multicollinearity or other numerical problems.
<matplotlib.legend.Legend at 0x1769e2ef0>
../_images/1b0d496a9cd189d0eebea5f6944fcef310419c2f5884fdf43b59cd6a1dd813f9.png
fig, ax = plt.subplots()
ax.plot(quad_mdl.fittedvalues,quad_mdl.resid,linewidth=1)
ax.set_title('Residuals vs Fitted Values')
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
print('RMSE = ' + '{:4.4f}'.format(np.sqrt(quad_mdl.mse_resid)) + ' ppm')
RMSE = 2.2277 ppm
../_images/ceaaa5ec7d8fdc38be3811fd0c263137bbe34f369496d79cd13d71761388f4ca.png

Question 3 Does this improve the fit? By how much have we shrunk the RMSE of residuals?

Comment in detail on how the addition of a quadratic term has changed plots a, b c and d. What issues have been solved? What issues remain?

[ANSWERS HERE]

3. Adding Periodic components#

It should be obvious from the simple visual inspection of the original series that there is a strong periodic component we are not currently capturing. It turns out to the a simple seasonal cycle: in spring/summer, the growth of terrestrial biomass sequesters CO2 away from the atmosphere, so the atmospheric CO2 concentrations goes down; the reverse happens in the Fall/Winter, where the fallen leaves are degraded and their organic carbon returned to mineral (CO2) form via respiration.

Let us thus define harmonic waves (sines and cosines) with a period of 1 year and add them to the design matrix.

H = np.column_stack((np.cos(2*np.pi*t),np.sin(2*np.pi*t)))
Xh = np.concatenate((np.vander(t, 3),H),axis=1)
seas_mdl = sm.OLS(y, Xh).fit()
print(seas_mdl.summary())
# plot predicted values
yhat = seas_mdl.predict(Xh) 
plt.plot(t, y, label="Raw Data",linewidth=1)
plt.plot(t, yhat, label=r"$CO_2 = \beta_0 + \beta_1 t + \beta_2 t^2 + \beta_3 cos(2\pi t) + \beta_4 sin(2\pi t)  $", color='black', linewidth='2')
plt.ylabel(r"$CO_2$")
plt.xlabel("Date")
plt.legend()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CO2   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 2.213e+05
Date:                Tue, 21 May 2024   Prob (F-statistic):               0.00
Time:                        15:38:15   Log-Likelihood:                -1070.5
No. Observations:                 792   AIC:                             2151.
Df Residuals:                     787   BIC:                             2174.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0134      0.000    130.916      0.000       0.013       0.014
x2           -51.8860      0.409   -126.934      0.000     -52.688     -51.084
const       5.039e+04    406.930    123.831      0.000    4.96e+04    5.12e+04
x3            -1.0027      0.047    -21.233      0.000      -1.095      -0.910
x4             2.6689      0.047     56.733      0.000       2.577       2.761
==============================================================================
Omnibus:                        8.817   Durbin-Watson:                   0.529
Prob(Omnibus):                  0.012   Jarque-Bera (JB):                6.328
Skew:                           0.090   Prob(JB):                       0.0423
Kurtosis:                       2.600   Cond. No.                     4.84e+10
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+10. This might indicate that there are
strong multicollinearity or other numerical problems.
<matplotlib.legend.Legend at 0x176bf2fe0>
../_images/6b667267e513b65d8991240fd8575104cac2ab60fdabaef7cea255aa3f7bbb0f.png
fig, ax = plt.subplots()
ax.plot(seas_mdl.fittedvalues,seas_mdl.resid,linewidth=1, alpha = 0.6)
ax.set_title('Residuals vs Fitted Values')
ax.set_xlabel('Fitted values')
ax.set_ylabel('Residuals')
print('RMSE = ' + '{:4.4f}'.format(np.sqrt(seas_mdl.mse_resid)) + ' ppm')
RMSE = 0.9379 ppm
../_images/4fb89d62a823351e237ecbaf2f87e95031af01338647afd836a823e5e72598be.png

Question 4

This should look better. Does it? By how much have we shrunk the RMSE of residuals?

print('{:4.4f}'.format(100*(1-np.sqrt(seas_mdl.mse_resid/lin_mdl.mse_resid))))
80.8727

Answer: over 80% improvement!

Seasonal-Trend decomposition using LOESS (STL)#

As it turns out, statsmodels already contains much more sophisticated tools to do this sort of work. Witness STL decomposition in action:

from statsmodels.tsa.seasonal import STL
co2_s = pd.Series(
    co2['CO2'].values, index=pd.date_range("3-15-1958", periods=len(co2), freq="M"), name="CO2"
)
stl = STL(co2_s, seasonal=13)
stl_res = stl.fit()
fig = stl_res.plot()
../_images/37321f70388a0ff6bf2b70eeca7db539aa06b27009eba0b7a2161be2ae27ae50.png
rmse = np.sqrt(np.square(stl_res._resid.values).mean())
print('RMSE = ' + '{:4.4f}'.format(rmse) + ' ppm')
RMSE = 0.2215 ppm

As we can see, this decomposition got us much, much closer to the observations, with an RMSE of just 0.22 ppm. This is because the seasonal cycle is allowed to fluctuate in amplitude, and the trend is also more general than a quadratic. However, this is more of a black box.

4. Forecasting#

Let us return to our parametric model of atmospheric \(CO_2\) (quadractic trend + seasonal harmonics), which we may use to forecast future observations. Let’s extend our time axis to 2050 and see what \(CO_2\) may look like then if we don’t curb emissions or fail to remove it from the atmosphere:

tn = t.to_numpy()
dt = np.diff(tn)[0]
tf = np.arange(max(tn)+dt,2050,dt)
tp = np.concatenate((tn,tf))
nf = len(tf) #number of future values to forecast
Hp = np.column_stack((np.cos(2*np.pi*tp),np.sin(2*np.pi*tp)))
Xp = np.concatenate((np.vander(tp, 3),Hp),axis=1)

yh = seas_mdl.predict(Xp) 
plt.plot(t, y,'k.',alpha=0.5, label="Raw Data")
plt.plot(tp, yh, label=r"$CO_2 = \beta_0 + \beta_1 t + \beta_2 t^2 + \beta_3 cos(2\pi t) + \beta_4 sin(2\pi t)  $", linewidth='2')
plt.ylabel(r"$CO_2$")
plt.xlabel("Date")
plt.legend()
<matplotlib.legend.Legend at 0x176b4a8f0>
../_images/3fbbcc26d515cee7d0f53904bea78a1baf7a69ea56904f2e0ac0782d744b5f0d.png

From this, one can estimate. e.g. the time of first crossing 450 ppm, for instance.

Of course, those predictions are only as good as the data that went into them, and the assumptions that the same processes will continue to operate at the same rate. Note that there are much more sophisticated methods for timeseries forecasting basd on machine learning, e.g. tsfresh. Another option that would likely work here is using seasonal ARIMA models, as done here.