Chapter 4: Spectral Analysis

Chapter 4: Spectral Analysis#

by Julien Emile-Geay, University of Southern California

Recall that in spectral analysis we step into a “bizarro world”, switching from the time domain to the frequency domain. There, two essential features are of interest: peaks, and background. In this tutorial, we will perform spectral analysis on Rio Grande streamflow, and see if we can learn anything interesting about it in this way.

Packages#

In addition to the Scientific Python stack (numpy, scipy, matplotlib), this tutorial showcases:

pandas
Pyleoclim was designed for the analysis of paleoclimate timeseries, but we show here that is perfectly able to deal with any timeseries.

Dataset#

We will be looking at the daily discharge of the Rio Grande, which has been measured at Embudo, NM since 1889. The data and their associated metadata may be retrieved from the USGS website. Let’s load them and have a look:

Data Wrangling#

Once again we use pandas to load the data from the web. We need to give it a little help by specifying the column names:

import pandas as pd
url = 'https://waterdata.usgs.gov/nwis/dv?cb_00060=on&format=rdb&site_no=08279500&legacy=&referred_module=sw&period=&begin_date=1889-01-01&end_date=2024-05-20'

df = pd.read_csv(url, header=30, sep='\t', names=['agency', 'site_no', 'datetime','discharge (cf/s)','code'])
df.head()

	agency	site_no	datetime	discharge (cf/s)	code
0	USGS	8279500	1889-01-01	398.0	A
1	USGS	8279500	1889-01-02	385.0	A
2	USGS	8279500	1889-01-03	395.0	A
3	USGS	8279500	1889-01-04	400.0	A
4	USGS	8279500	1889-01-05	413.0	A

the fourth column is the one we want, and we’ll want to use the datetime column as a index for pandas. Let us create a new pandas Series from this dataframe, with this new and improved index:

date = pd.to_datetime(df['datetime'])
dis = df['discharge (cf/s)']
dis.index = date
dis

datetime
1889-01-01     398.0
1889-01-02     385.0
1889-01-03     395.0
1889-01-04     400.0
1889-01-05     413.0
               ...  
2024-05-16    1000.0
2024-05-17    1020.0
2024-05-18    1030.0
2024-05-19    1150.0
2024-05-20    1280.0
Name: discharge (cf/s), Length: 49448, dtype: float64

First, let us inspect the distribution of values:

dis.hist()

<Axes: >

../_images/38552209e66f76bc9fc0157881524bfebae32c1d4aacb452d4a42a8b23478e20.png

Like many other variables, streamflow is positively skewed: the distribution is asymmetric, with most values near zero and few instances of very high streamflow. Now let’s inspect the temporal behavior:

dis.plot()

<Axes: xlabel='datetime'>

../_images/8fc8320b95021491a6ce568af3f36e521eb3db99597863d2a2bb796b52b9ac2c.png

We see that there is a large gap in the dataset, which could complicate analysis. Overall, a significant percentage of values are missing (coded as NaNs). We can estimate that percentage as follows, using pandas’s isna():

dis.isna().sum()/len(dis)*100

6.651431807150947

This is no problem for Pyleoclim, which is built to handle missing values, even large gaps like the one in the 1910s. However, daily data are a lot! Let’s resample to a monthly interval to lighten up calculations a bit:

dis_c = dis.resample('M').mean()
dis_c.plot()

<Axes: xlabel='datetime'>

../_images/b6fe8e2c2541e5b8aefa95de030f47d1613dabb250859e98d601906cfdedc3e9.png

Conversion to Pyleoclim#

Having loaded and inspected the data, we are ready to start analyzing it. We do so via the Pyleoclim package. Here we show one way to ingest the pandas Series object into a Pyleoclim Series object. In addition to dis_c, we need a little bit of metadata:

metadata = {
        'time_unit': 'years CE',
        'time_name': 'Time',
        'value_unit': 'cf/s',
        'value_name': 'discharge',
        'label': 'Rio Grande at Embudo, NM (monthly)',
        'archiveType': 'Instrumental',
        'importedFrom': 'USGS',
    }

import pyleoclim as pyleo
ts = pyleo.Series.from_pandas(ser=dis_c,metadata=metadata)
ts

{'archiveType': 'Instrumental',
 'importedFrom': 'USGS',
 'label': 'Rio Grande at Embudo, NM (monthly)'}

None
Time [years CE]
086207     432.548387
162869     472.607143
247744     781.483871
329881    2259.666667
414756    3430.645161
                  ...     
084299     413.967742
163699     494.068966
248574     573.451613
330711     875.900000
415586    1190.000000
Name: discharge [cf/s], Length: 1625, dtype: float64

ts.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [years CE]', ylabel='discharge [cf/s]'>)

../_images/84e211613e6d3a39a100b1b278d89cb2ddbd80d3390309a547ed787c02a81dcc.png

There seem to be two segments. Let’s see what Pyleoclim makes of them:

ms = ts.segment()
for s in ms.series_list:
    s.plot()

../_images/f2228b76eb8207b09a9ecbbbf33dc5ad928d9cee66f35ae192429fabf4706083.png

../_images/a7b5aefa84ed43de3626fb38ee3d8c21f493f1c6a9ec2865509805b3f043642e.png

Note that despite our monthly resampling, the series are not actually evenly spaced:

ts_late.resolution().dashboard()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 ts_late.resolution().dashboard()

NameError: name 'ts_late' is not defined

Recall that most timeseries methods assume even spacing (no gaps). We have two options: apply dedicated gap-tolerant spectral methods, or find away to fill the gaps so that we can apply classical techniques that assume even spacing.

Gap-free analysis#

Let us focus on the last segment for our analysis. Let’s use the spectral() method and select the multi-taper method (MTM)

ts_late = ms.series_list[1]
psd = ts_late.standardize().spectral(method='mtm')
psd.plot()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[213], line 2
      1 ts_late = ms.series_list[1]
----> 2 psd = ts_late.standardize().spectral(method='mtm')
      3 psd.plot()

File ~/Documents/GitHub/Pyleoclim_util/pyleoclim/core/series.py:2998, in Series.spectral(self, method, freq, freq_kwargs, settings, label, scalogram, verbose)
   2990     res = Results(amplitude=scalogram.amplitude, coi=scalogram.coi,
   2991                   freq=scalogram.frequency, time=scalogram.time,
   2992                   scale=scalogram.wave_args['scale'],
   2993                   mother=scalogram.wave_args['mother'],
   2994                   param=scalogram.wave_args['param'])
   2995     args['cwt'].update({'cwt_res':res})
-> 2998 spec_res = spec_func[method](self.value, self.time, **args[method])
   2999 if type(spec_res) is dict:
   3000     spec_res = dict2namedtuple(spec_res)

File ~/Documents/GitHub/Pyleoclim_util/pyleoclim/utils/spectral.py:315, in mtm(ys, ts, NW, BW, detrend, sg_kwargs, gaussianize, standardize, adaptive, jackknife, low_bias, sides, nfft)
    313 check = is_evenly_spaced(ts)
    314 if check == False:
--> 315     raise ValueError('For the MTM method, data should be evenly spaced')
    316 # preprocessing
    317 ys = preprocess(ys, ts, detrend=detrend, sg_kwargs=sg_kwargs,
    318            gaussianize=gaussianize, standardize=standardize)

ValueError: For the MTM method, data should be evenly spaced

Here we got reprimanded for passing an unevenly-spaced series to a method that expects even spacing. Never mind; we can interpolate to a monthly sampling:

tsi = ts_late.interp(step=1/12) # note the step of 1/12, as the time has units of years
fig, ax = ts_late.plot()
tsi.plot(ax=ax, label = 'monthly interpolation')

<Axes: xlabel='Time [years CE]', ylabel='discharge [cf/s]'>

../_images/c44523977687c784c14f56cf9d107529a9599865905ffd82d1ca962383f6880d.png

We can see that the results are very close. Now let us apply MTM to the interpolated series:

psd_mtm = tsi.spectral(method='mtm')
psd_mtm.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/4e411c52093cb465a115f972063872a100cb896cf8700c143208d19c0347fdde.png

The prominent feature is a strong annual cycle and higher-order harmonics, super-imposed on a “warm colored” background (i.e. variations at long timescales are larger than variations at short timescales). There are hints of long-term scaling as well, one in the subannual range, and one from period of 1 to 50y. We may be interested in the shape of this background, and whether it can be fit by one or more power laws.

In addition, we may be interested in interannual (year-to-year) to interdecadal (decade-to-decade) variations in streamflow. We would like to know whether the peak observed around 3-4 years in the plot above is significant with respect to a reasonable null.

To do both of these things it would make sense to remove the annual cycle. There are several options to do so:

STL decomposition (see Intro notebook)
calculating and removing the monthly-mean seasonal cycle

To apply STL, we need to get back into pandas land. Thankfully, Pyleoclim has utilities that make interconversion with pandas a breeze. Let’s return to our monthly-sampled dataframe, but only select the indices that correspond to the late segment identified by Pyleoclim:

dis_late = dis_c.loc[dis_c.index > ts_late.to_pandas().index.min()] # select the last segment of the dataframe
dis_late.plot()

<Axes: xlabel='datetime'>

../_images/b00f24aa6292140e43b0dd6e5e06f4225d18e0f4dc038d32faccd7cd735f1e48.png

Now, because of the positive skew mentioned above, it turns out that the STL decomposition has trouble with this dataset. Instead, we can use the gaussianize() to map this dataset to a standard normal. There are applications for which this would be undesirable, but for this purpose it is good:

ts_lg = tsi.gaussianize()
ts_lg.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [years CE]', ylabel='discharge [cf/s]'>)

../_images/86d8ee2414057bac70f9b205bbb1df5be1bdbcdc65051a3919e78f486e395707.png

ts_lg.spectral(method='mtm').plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/b660ed18f6b13695ed4e6588ccbff81096f3c65906258f32ca49d01990bec12e.png

We see that the spectrum is nearly unchanged, but now we can apply STL:

from statsmodels.tsa.seasonal import STL
dis_m = tsi.to_pandas().resample('M').nearest()
stl = STL(dis_m, seasonal=13)
stl_res = stl.fit()
fig = stl_res.plot()

../_images/226a3246060830e49d63bc2044a5e3df181b3e91cfe451f35b1b459ed458c602.png

Let’s turn this into a Pyleoclim series and apply spectral analysis

anom = pyleo.Series.from_pandas(stl_res.trend,metadata)
anom.label = tsi.label + ' (anom)'
anom.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [years CE]', ylabel='discharge [cf/s]'>)

../_images/8f7574c938a022484b2610cbf76b27b3f79345ea6cd8ee3121750cef19628f8d.png

psd_a = anom.interp().spectral(method='mtm')
psd_a.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/93557deef44ee8e962a549e572d3071a9f59a4cca703faa7ab72bf0e0e01f209.png

Indeed this removed the annual cycle and its harmonics (for the most part). It’s a fairly aggressive treatment, but we can now test the significance of the interannual peaks w.r.t to an AR(1) benchmark.

psd_ar1 = psd_a.signif_test(method='ar1sim',number=1000)
psd_ar1.plot()

Performing spectral analysis on individual series: 100%|██████████| 1000/1000 [00:42<00:00, 23.41it/s]

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/2e183a67dda22dc15e7915377f2a1f8acf4f2a168db15022e7719402f48812f0.png

Note that we specified 1,000 surrogates to be used for this assessment of significance. The default is 200, but that is too few for research-grade applications, and certainly for publication.

If you wanted a colored-noise (power-law) benchmark instead, you would invoke it via the method argument:

psd_cn = psd_a.signif_test(method='CN',number=1000)
psd_cn.plot()

Performing spectral analysis on individual series: 100%|██████████| 1000/1000 [00:43<00:00, 23.22it/s]

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/6ba22ea867881b8c1ec577bbd9f70e0dbbbf603a11b23b24ad079b329ebdda31.png

We see that in both types of surrogates, the broadband peak in the 2-10y range is above the 95% threshold, so these frequencies could be considered significant.

Estimation of scaling behavior#

In this last example, Pyleoclim fit a single scaling exponent to the timeseries, spanning the entire range of frequencies(periods). Under the hood, all we are doing is fitting a straight line to the spectrum:

psd_a.beta_est().plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/9573970090ac3ff6b87464ed6b07c9e9b841678facfbc647d43e7329bdbc01c9.png

This results in a fairly steep exponent near 2.4. If we were specifically interested in the scaling exponent (spectral slope) between periods of 2-100y, you would do it like so:

psd_a.beta_est(fmin=1/100, fmax=1/2).plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/3b7fc0635bfb5a553355167312ca336d14c6d2ed59f316cde1228079fd136f08.png

We see that this results in a much flatter line, i.e. a smaller scaling exponent around 0.75

Gap-tolerant spectral analysis#

We return to the original series and apply 2 techniques to obtain the spectrum, keeping gaps in th series.

By default, spectral() invokes the Lomb-Scargle periodogram
method = wwz invokes the Weighted-Wavelet Z-transform

psd_ls = ts.standardize().spectral()
psd_ls.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/6138025e2f81a3968d95cdea3e801c7db10798d8ff958cbee3dc77bb824ecd9c.png

psd_wwz = ts.standardize().spectral(method='wwz')
psd_wwz.plot()

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Period [years]', ylabel='PSD'>)

../_images/a4fdd12955cf3c473307819b7cbccf9d3a285bdbb4329181aa6576bbaf2e2bf3.png

Let’s plot all those spectra on the same figure:

fig, ax = psd_mtm.plot(label='segment 2, MTM')
psd_ls.plot(ax=ax,label='whole series, Lomb-Scargle',alpha=0.6)
psd_wwz.plot(ax=ax,label='whole series, WWZ',alpha=0.6)

<Axes: xlabel='Period [years]', ylabel='PSD'>

../_images/de16d107c4a30e2dbeaaff7a7b799c3e004d2b1cdad56d80da0732380ce6649c.png

It is often useful to be able to compare methods and/or parameter choices, to see if the results are robust. We see that some choices lump peaks into broad bands, others tend to slice them up.

FAILED SSA attempt#

ssa = 
ssa.screeplot()

(<Figure size 600x400 with 1 Axes>,
 <Axes: title={'center': 'SSA scree plot'}, xlabel='Mode index $i$', ylabel='$\\lambda_i$'>)

../_images/6dfe833f26b5de7a0b4210ec61fbce2b7fba1456333ed9248a79401a1017c370.png

for i in range(10):
    ssa.modeplot(index=i)

../_images/da33a12095cb3fad297e85c0ceda696acc27c27ee72c756d6dae8aaf431ff338.png

../_images/612f1baf9a35b07ed140b3689dbb4997f102d8197249be3541cc1313bbe7978b.png

../_images/40527a54e910e27d7627d82ffc009f114c3ac47a34bb547bfe0cda7f3489ea36.png

../_images/a87a952f3fc0d5da87291dce46b911b781fa1ececc0cb6f652a7fd89ca6ad95e.png

../_images/3cf33691e78f088a11ffa7fb5bb19b69cfec3ffbffc0b9faac294ee53ba4e467.png

../_images/2c0a1242907bbb95f1f6a864557f1b24900c7cb409ba98ef6fa77fde2797cfc3.png

../_images/8c62110275cc73989d9292e4dc63831685f8fa369e8065d6cb285aaf136adc91.png

../_images/b06ddbdbeac63ed531f059e91af56c9c7f8361fc25c7cde2c02db6de16f6747b.png

../_images/88403a3ea14aebdd042c9c449496640500396fed896b8fe5cfe9f477dee0ed76.png

../_images/0e70fa5089e2160f4f65b7ac92ae3e816ec45fd5e7cd102b090e9bb108b81798.png

Subannual modes identified by SSA:

subannual_modes = [0, 1 ,2, 3, 5, 6]
M = ssa.RCmat.shape[1] # number of modes
modes2keep = [i for i in range(M) if i not in subannual_modes] # generate list of modes by list comprehension and range constructor
modes2keep

ts_rc = ts_lwc.copy()
ts_rc.value = ssa.RCmat[:,modes2keep].sum(axis=1)

ts_rc.plot(label='SSA reconstruction')

(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Time [years CE]', ylabel='discharge [cf/s]'>)

../_images/8f27041554e082f8a3c569c06ac380c6118a9f3bc5aa2f6e4080b64059fae04e.png