Outlier Detection with Pyleoclim#

by Deborah Khider & Pin-Tzu Lee

Preamble#

Goals#

  • Understand outlier detection through clustering

  • Apply detection algorithm and understand the difference between the two algorithms used in Pyleoclim, namely k-means and DBSCAN.

  • Understand the diagnostic metrics returned by each algorithm

  • Understand the parameters and when to apply them

Reading time: 10 minutes

Keywords#

Outlier detection; anomaly detection; clustering; unsupervised machine learning.

Pre-requisites#

An understanding of clustering is useful

Relevant Packages#

scikit-learn

Data Description#

This tutorial makes use of the following dataset, stored in LiPD format:

  • Khider, D., Jackson, C. S., and Stott, L. D. (2014), Assessing millennial-scale variability during the Holocene: A perspective from the western tropical Pacific, Paleoceanography, 29, 143– 159, doi:10.1002/2013PA002534.

Demonstration#

Let’s import the necessary packages:

%load_ext watermark

import pyleoclim as pyleo
import pandas as pd
from pylipd.lipd import LiPD

# For clustering
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot  as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import silhouette_score
import math


import warnings
warnings.filterwarnings('ignore')
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
D = LiPD()

D.load('../data/MD982181.Khider.2014.lpd')

timeseries,df = D.get_timeseries(D.get_all_dataset_names(),to_dataframe=True)

df
Loading 1 LiPD files
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  1.57it/s]
Loaded..
Extracting timeseries from dataset: MD982181.Khider.2014 ...
mode time_id funding1_agency funding1_grant funding1_principalInvestigator funding1_country funding2_agency funding2_principalInvestigator funding2_country funding2_grant ... paleoData_interpretation paleoData_takenAtDepth paleoData_inferredVariableType paleoData_notes paleoData_physicalSample_collectionmethod paleoData_physicalSample_name paleoData_physicalSample_housedat paleoData_sensorGenus paleoData_sensorSpecies paleoData_calibration
0 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'variableDetail': 'calendar', 'rank': 1.0, '... Depth Age NaN NaN NaN NaN NaN NaN NaN
2 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'hasVariable': {'label': 'temperature'}, 'sc... Depth NaN Average duplicates from paleo1measurementTable1 Calypso Giant Corer MD98-2181 Oregon State University Cibicidoides mundulus NaN
3 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'scope': 'Isotope', 'interpdirection': 'posi... depth_cm d18O NaN NaN NaN NaN NaN NaN [{'equation': 'T = 14.9 - 4.81(dc-dsw+0.27)', ...
4 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'variableDetail': 'sea surface', 'local': 'l... depth_cm NaN Average duplicates from paleo1measurementTable1 Calypso Giant Corer MD98-2181 Oregon State University Globigerinoides ruber NaN
5 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'scope': 'Climate', 'interpdirection': 'posi... depth_cm Sea Surface Temperature NaN NaN NaN NaN NaN NaN [{'equation': 'Mg/Ca = 0.5exp(0.08T)', 'uncert...
6 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'local': 'local', 'interpdirection': 'positi... depth_cm NaN Average duplicates from paleo1measurementTable1 Calypso Giant Corer MD98-2181 Oregon State University Globigerinoides ruber NaN
8 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'scope': 'Age', 'rank': 1.0, 'hasVariable': ... depth_cm Age NaN NaN NaN NaN NaN NaN NaN
9 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... NaN depth_cm NaN NaN NaN NaN NaN NaN NaN NaN
10 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... NaN depth_cm NaN NaN Calypso Giant Corer MD98-2181 Oregon State University Globigerinoides ruber NaN
11 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'local': 'local', 'hasVariable': {'label': '... depth_cm NaN NaN Calypso Giant Corer MD98-2181 Oregon State University Globigerinoides ruber NaN
12 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'rank': 1.0, 'hasVariable': {'label': 'tempe... depth_cm NaN NaN Calypso Giant Corer MD98-2181 Oregon State University Globigerinoides ruber NaN
13 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... NaN depth_cm NaN NaN Calypso Giant Corer MD98-2181 Oregon State University Cibicidoides mundulus NaN
14 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... [{'hasVariable': {'label': 'd18Osw'}, 'local':... depth_cm NaN NaN Calypso Giant Corer MD98-2181 Oregon State University Cibicidoides mundulus NaN
15 paleoData age National Science Foundation [AGS#1049238] L. Stott United States National Science Foundation L. Stott United States [AGS#1344514] ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

16 rows × 87 columns

Let’s use the sea surface temperatures (SST) record:

row = df[df['paleoData_variableName']=='temperature']

ts = pyleo.Series(
    time=row['age'].values[0],
    value=row['paleoData_values'].values[0],
    time_name = 'Age',
    time_unit = 'yr BP',
    value_name = 'SST',
    value_unit = 'deg C',
    label='MD982181.Khider.2014'
)
NaNs have been detected and dropped.
Time axis values sorted in ascending order

Let’s plot the timeseries:

ts.plot()
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [yr BP]', ylabel='SST [deg C]'>)
../_images/3f6623f82f813965185425037e74119340db06bbc1ef77e3edf3b070edfff22a.png

This record contains a long-term trend through the Holocence. In addition, it is fairly noisy, with some spurious spikes. Tempted to remove them by hand? Let’s use the outlier detection algorithm on this dataset first.

Unsupervised machine learning - clustering#

Unsupervised learning learns patterns from unlabeled data. One example of unsupervised learning is clustering, which groups a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). Outliers can then be identified depending on how far they fall from the cluster. They are several methods for clustering. In Pyleoclim, we make use of two algorithms:

  • k-means, which is centroid based. The principle is simple: find the \(k\) cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. This algorithm requires the number of clusters to be known in advance. Alternatively, one can sweep over a number of clusters and use an optimization metric to decide on the number of clusters present in the dataset. Outliers are then indentified as data falling from a certain distance away from the centroid or its local density compared to the local densities of its neighbors (LOF). We’ll elaborate more on outlier detection later.

  • DBSCAN is an example of density-based clustering, in which clusters are defined as areas of higher density than the remainder of the data set. Data that sit in sparse areas are usually considered to be outliers.

Let’s have a look on how these two methods work for clustering.

DBSCAN#

Pyleoclim makes use of the method in scikit-learn. Let’s import the package and run through an example:

First, let’s make some random blobs of 300 data points for illustrative purposes. Let’s create four clusters, with data clustered within a 0.60 standard deviation.

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
plt.scatter(X[:,0], X[:,1], alpha=0.6)
<matplotlib.collections.PathCollection at 0x33f5998d0>
../_images/3c277bd4c40c69e5dbd8c18b9007374e020811ba41da9bf38a6f4c5c3f8de628.png

Have a look at the parameters for the DBSCAN algorithm. The most relevant ones are:

  • eps: Threshold \(\epsilon\). Two points are considered neighbors if the distance between the two points is below the threshold \(\epsilon\).

  • min_samples: The minimum number of neighbors a given point should have in order to be classified as a core point. It’s important to note that the point itself is included in the minimum number of samples. If the value of this parameter is set to 8, then there must be at least 8 points to form a cluster.

  • metric: The metric to use when calculating distance between instances in a feature array (i.e. euclidean distance).

Let’s have a look using semi-random values for eps and min_samples:

m = DBSCAN(eps=0.3, min_samples=5)
m.fit(X)
DBSCAN(eps=0.3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The labels_ property contains the list of clusters and their respective points:

clusters = m.labels_
np.unique(clusters)
array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Using these particular values, DBSCAN identified 11 clusters (the points contained in the ‘-1’ label are considered outliers by the algorithm), which is more than the 4 blobs they were originally created. Let’s plot them in various colors. The outliers will be plotted in navy (dark blue):

colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])

plt.scatter(X[:,0], X[:,1], c=vectorizer(clusters), alpha=0.6)
<matplotlib.collections.PathCollection at 0x33efa2c10>
../_images/eb2482a12d9a70e38a94a7b4ad4fbd46d43e94b4ed43bccfd564a2866638e500.png

There seems to be a lot of outliers and clusters that seem to be very close to each other. Remember that we chose a distance threshold of 0.3. Let’s increase this value to 0.7:

m2 = DBSCAN(eps=0.7, min_samples=5)
m2.fit(X)
DBSCAN(eps=0.7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clusters = m2.labels_
np.unique(clusters)
array([-1,  0,  1,  2])

We have identified 3 clusters (and other outliers).

colors = ['royalblue', 'maroon', 'forestgreen', 'navy']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])

plt.scatter(X[:,0], X[:,1], c=vectorizer(clusters), alpha=0.6)
<matplotlib.collections.PathCollection at 0x33f00e750>
../_images/582928cb9eff17aed92cfadb911783d86a26703f19a9e9938bf034b393df2e8d.png

With this value of eps, we now have three clusters and some outliers. You may ask yourself, if this algortihm is so sensitive to the parameter values, how do I know which one to choose? One solution (the one implemented by Pyleoclim) is to sweep throught a range of values for eps and min_sample and evaluate the best combination using a performance metric, in this case, the silhouette score.

The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

Let’s calculate the silhouette score for the two different eps values:

s1 = silhouette_score(y.reshape(-1, 1), m.labels_)
s1
-0.25666666666666665

The silhouette score for eps = 0.3 is negative and close to zero, which indicates that this really wasn’t an appropriate choice for parameter values.

s2 = silhouette_score(y.reshape(-1, 1), m2.labels_)
s2
0.41586666666666666

A silhouette score of 0.42 is definitely better, but let’s see if we can improve upon the score by sweeping across values of the parameters. Now, the question becomes, what are appropriate values for eps and min_samples to sweep over. Well, let’s start with min_samples.

Essentially, this is the number of samples to expect at a minimum in each cluster, which means that we need at least 2 samples. The maximum values will depend on the application but let’s assume that we don’t want more than 1/4 of the data:

min_samples_list = np.linspace(2,len(y)/4,50,dtype='int')

For the eps value, we can decide based on distance to the nearest neighbor:

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5)
nbrs = neigh.fit(X)
distances, indices = nbrs.kneighbors(X)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
plt.ylabel('distance')
Text(0, 0.5, 'distance')
../_images/f6da033e38d97cb15c88d2342bf3d374d7e1ba843ec516dfa02ad3201a33d1f2.png

From this plot, it would make sense to sweep from 0.05 to 0.8. If you are familiar with DBSCAN, you know that this method can be used to define the “best” eps value as the “knee” of the curve (or elbow depending on your perspective). This would indicate that 0.3 would have been the best value; yet we have shown that it defnitely isn’t. For this reason, Pyleoclim does not emply this method by default and we caution against the use of it.

eps_list = np.linspace(0.05, 0.8, 20)

Now let’s sweep!… and keep track of a few things:

nbr_clusters=[] # the number of clusters
sil_score =[] # the silhouette score
eps_matrix=[] # tracking the value of eps for each min_sample
min_sample_matrix=[] # tracking the value of min_sample for each eps
for eps in eps_list:
    for min_samples in min_samples_list:
        eps_matrix.append(eps)
        min_sample_matrix.append(min_samples)
        m = DBSCAN(eps=eps, min_samples=min_samples)
        m.fit(X)
        nbr_clusters.append(len(np.unique(m.labels_)))
        try:
            sil_score.append(silhouette_score(X, m.labels_))
        except:
            sil_score.append(np.nan)
res = pd.DataFrame({'eps':eps_matrix,'min_samples':min_sample_matrix,'number of clusters':nbr_clusters,'silhouette score':sil_score})
res
eps min_samples number of clusters silhouette score
0 0.05 2 20 -0.624168
1 0.05 3 2 -0.224822
2 0.05 4 1 NaN
3 0.05 6 1 NaN
4 0.05 7 1 NaN
... ... ... ... ...
995 0.80 69 1 NaN
996 0.80 70 1 NaN
997 0.80 72 1 NaN
998 0.80 73 1 NaN
999 0.80 75 1 NaN

1000 rows × 4 columns

And now let’s locate the maximum value of the silhouette score (which hopefully is close to 1):

res.loc[res['silhouette score']==np.max(res['silhouette score'])]
eps min_samples number of clusters silhouette score
905 0.760526 9 5 0.672543
906 0.760526 10 5 0.672543
955 0.800000 9 5 0.672543
956 0.800000 10 5 0.672543

This looks promising: I’m beeing told that I have 5 clusters (the 4 clusters I asked for in the blob + the outliers) and the silhouette score is closer to 1. Let’s plot #906:

m3 = DBSCAN(eps=0.760526, min_samples=10)
m3.fit(X)
colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'deeppink']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
plt.scatter(X[:,0], X[:,1], c=vectorizer(m3.labels_), alpha=0.6)
plt.title('eps: 0.76; min_samples: 10')
plt.show()
../_images/fd8a7a45a49f9e51ab8fff6e8b39f7b5559e19c8ac17a288733f10426ea7049b.png

The clusters look like something I would have chosen by eye, with 2 outliers (deep pink) on the outside.

k-means#

Pyleoclim uses the k-means method from scikit-learn. In this case, the most important parameter is the number of clusters. Let’s use this algorithm on the blobs, with a number set to 4:

kmeans = KMeans(n_clusters=4)
m = kmeans.fit(X)

Let’s have a look at our four clusters:

colors=['royalblue', 'maroon', 'forestgreen', 'mediumorchid']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])

plt.scatter(X[:,0], X[:,1], c=vectorizer(m3.labels_), alpha=0.6)
<matplotlib.collections.PathCollection at 0x33f250d10>
../_images/15a99ec57fed2bf225fbde5942ac4a64f8081dc46aa07138410bff67a4c96e6e.png

Now that we have our four clusters, how do we determing the outliers?

Outliers are identified based on their distance from the clusters. This can be done in two ways: (1) by using a threshold that corresponds to the Euclidean distance from the centroid and (2) using the Local Outlier Function

Let’s take a look at the result of these two different approaches:

threshold#
threshold = 1
center=m.cluster_centers_[m.labels_]
distance = []
for idx,item in enumerate(center):
    distance.append(math.dist(item,X[idx,:]))
idx_out = np.argwhere(np.array(distance)>threshold).reshape(1,-1)[0]

Let’s plot our clusters and outliers in red:

plt.scatter(X[:,0], X[:,1], c=vectorizer(m3.labels_), alpha=0.6)
plt.scatter(X[idx_out,0], X[idx_out,1],c='r', alpha=0.6)
<matplotlib.collections.PathCollection at 0x33f260610>
../_images/280967543a133ed77bd0f740d75a66bb5a12810c142afa34d688ae7085c5b528.png

How to choose the right number of clusters? Visual inspection can be useful but a simlar sweep with the sillhouette score as the performance metric can also be used.

LOF#
n_frac = 0.01
contamination = 'auto'
model = LocalOutlierFactor(n_neighbors=int(X.size*n_frac), contamination=contamination)
pred = model.fit_predict(X)
idx_out = np.where(pred==-1)
plt.scatter(X[:,0], X[:,1], c=vectorizer(m3.labels_), alpha=0.6)
plt.scatter(X[idx_out,0], X[idx_out,1],c='r', alpha=0.6)
<matplotlib.collections.PathCollection at 0x33237bbd0>
../_images/56f95180ab1db3eb0651e5b1ee94985663864487d6e1ac4e469dc445f686ab1f.png

Outlier detection in Pyleoclim#

Pyleoclim employs the two clustering algorithms details aboved for its outlier detection in timeseries data. The functions take several arguments, the most relevant being listed here:

  • method – You can select either ‘kmeans’ or ‘DBSCAN’

  • remove (bool, optional) – If True, removes the outliers.

  • settings (dict, optional) – Specific arguments for the clustering functions. As detailed above the argument for each of these methods can be quite different. Let’s have a look:

    • DBSCAN:

      • nbr_clusters – Number of clusters. Remember that the DSBCAN method finds the number of clusters automatically. This is an optimization parameter, allowing you to override the behavior of looking for the highest silhouette score.

      • eps – epsilon. The default is None, which allows the algorithm to optimize for the best value of eps, using the silhouette score as the optimization criterion. The sweep is made over eps values determined from nearest neighbors optimization as described above. You can choose to pass your own list for the sweep or choose one value.

      • min_samples – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. The default is None and optimized using the silhouette score. The sweep occurs from 2 to len(y)/4. You can pass your own list for the sweep.

      • n_neighbors – Number of neighbors to use by default for kneighbors queries, which can be used to calculate a range of plausible eps values. The default is None, which results in using the scikit-learn default.

    • kmeans:

      • nbr_clusters – A user number of clusters to considered. The default is None to use the optimization from the Silhouette Score. If the number of clusters can be inferred from a plot, use this parameter to set it.

      • max_cluster – The maximum number of clusters to consider in the optimization based on the Silhouette Score. The default is 10.

      • threshold – The algorithm uses the suclidean distance for each point in the cluster to identify the outliers. This parameter sets the threshold on the euclidean distance to define an outlier. The default is 3. By default, the algorithm standardizes the timeseries. Therefore, a value of 3 represents three standard deviation away from the mean.

      • LOF – By default, detect_outliers_kmeans uses euclidean distance for outlier detection (LOF = False). Set LOF to True to use LocalOutlierFactor for outlier detection. LOF measures the local deviation of the density of a given sample with respect to its neighbors. By comparing the local density of a sample to the local densities of its neighbors, one can identify samples that have a substantially lower density than their neighbors. These are considered outliers.

      • n_frac – The percentage of the time series length (the length, representing number of points) to be used to set the n_neighbors parameter for the LOF function in scikit-learn. We recommend using at least 50% (n_frac=0.5) of the timeseries. You cannot use 100% (n_frac!=1)

      • contamination – Same as LOF parameter from scikit-learn. We recommend using the default mode of ‘auto’. See scikit-learn document for details.

  • fig_outliers – Whether to display the timeseries showing the outliers. The default is True. This is a diagnostic plot, showing you the outliers in red.

  • fig_clusters – Whether to display the clusters. The default is True. This is a diagnostic plot.

  • keep_log – Keeps a log of the transformation. We recommend setting this value to True since the log will contain several diagnostics that can be very useful.

Let’s give this a try on the MD98-2181 series. As you can imagine, trends can be problematic for clustering so let’s remove it.

ts_p = ts.detrend().standardize()

ts_p.plot()
(<Figure size 1000x400 with 1 Axes>,
 <Axes: xlabel='Age [yr BP]', ylabel='SST [deg C]'>)
../_images/2d38efbbd4a206757182498d53b8c2cee87ca44e6534d4b3c6c522c8d7c09969.png

And let’s apply our outlier algorithm. We will start with kmeans, which is the default in Pyleoclim. Remember that these algorithms are fundamentally different, and may return different clusters.

Outlier dectection using k-means#

Using a threshold value based on the Euclidean Distance from centroid#

For the purpose of demonstration, let’s set the remove parameter to False, keep the log, and use the default removal method, which uses the threshold value.

ts_out = ts_p.outliers(method='kmeans', remove=False, keep_log=True)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/2d38efbbd4a206757182498d53b8c2cee87ca44e6534d4b3c6c522c8d7c09969.png ../_images/8cd79331ac07b3694d76c2266e24da411d5ccc6e3794f2cf157053832a6eb84d.png

Using the default settings, the number of clusters that maximizes the silhouette score is 4, with no outliers detected. Let’s have a look at the diagnostic kept in the log:

ts_out.log[0]['results']
number of clusters silhouette score outlier indices clusters
0 2 0.557350 [] [0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, ...
1 3 0.530963 [] [2, 0, 0, 1, 0, 0, 2, 0, 2, 0, 0, 1, 0, 2, 2, ...
2 4 0.565988 [] [2, 0, 2, 0, 0, 0, 2, 2, 2, 2, 0, 3, 0, 2, 2, ...
3 5 0.549426 [] [2, 0, 2, 0, 0, 2, 2, 2, 4, 2, 0, 3, 0, 4, 2, ...
4 6 0.539486 [] [2, 4, 2, 0, 4, 4, 2, 2, 1, 2, 0, 5, 4, 1, 2, ...
5 7 0.542952 [] [5, 6, 3, 0, 6, 3, 5, 3, 5, 3, 0, 2, 6, 5, 5, ...
6 8 0.548035 [] [1, 6, 7, 0, 6, 6, 1, 7, 1, 7, 0, 5, 0, 1, 1, ...
7 9 0.538591 [] [4, 8, 0, 3, 8, 8, 4, 0, 4, 0, 3, 1, 3, 4, 4, ...
8 10 0.540070 [] [8, 2, 4, 5, 2, 4, 8, 4, 0, 4, 5, 1, 5, 0, 8, ...

The dataframe contains several pieces of information:

  • the number of clusters - Remember that this is set as in input to the k-means algorithm and we are sweeping over this range of values.

  • the silhouette score - Notice that it doesn’t vary much in this case for the various number of clusters.

  • the indices of the outliers - In this case, there are none.

  • the cluster assignment for each point - This can be valuable if you are more interested in recovering the clusters than the outliers.

But let’s consider the silhouette score for a minute. There is not much difference between 0.557 (2 clusters) and the optimized 0.565 (4 clusters). However, an argument could be made that there are really two clusters in this series (above and below 0) and the outlying clusters in turquoise and light blue are just representing the noise.

Let’s run this again, but impose 2 clusters:

ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2}, remove=False, keep_log=True)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/2d38efbbd4a206757182498d53b8c2cee87ca44e6534d4b3c6c522c8d7c09969.png ../_images/0d869dff1827b3eaddf77cd4ed106afe42dc32ea23dd1451fcb9ff3544e8758d.png

Results are very much identical, with no outliers being detected.

Let’s lower the threshold:

ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'threshold':2}, remove=False, keep_log=True)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/34898ec35d3cf689388dfd95a0b35d1b8f33d012bfba8d1b5be8e78f34d8cba5.png ../_images/feefe497accf86036466a66554fc2ed5c690a145d6684ea1c58fe938b7fb67d6.png

With a lower threshold, the outmost values are being removed, which is not surprising. This is the most sensitive parameter when using the k-mean algorithm and should be chosen with care.

Using LOF#

Next, let’s try out the other outlier removal method: LOF. For the purpose of demonstration, we start with the default settings of n_frac = 0.9, contamination = 'auto', and also set the remove parameter to False and keep the log.

ts_out = ts_p.outliers(method='kmeans', settings={'LOF':True}, remove=False, keep_log=True)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/f619fe1cbdc74a25df66e25bfe944b25210e9f0341a8a3e4e3cc6448359bc9b8.png ../_images/57a6c2ff74a9dedcca9e09b66e622a63155b5edbb74bc4c84446693fe9b12e74.png
ts_out.log[0]['results']
number of clusters silhouette score outlier indices clusters
0 2 0.557350 ([],) [1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, ...
1 3 0.530963 ([],) [2, 0, 0, 1, 0, 0, 2, 0, 2, 0, 0, 1, 0, 2, 2, ...
2 4 0.565762 ([],) [1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 2, 0, 1, 1, ...
3 5 0.511526 ([],) [1, 2, 2, 0, 0, 2, 1, 1, 1, 2, 0, 4, 0, 1, 1, ...
4 6 0.543902 ([],) [4, 1, 4, 5, 1, 4, 4, 4, 0, 4, 5, 2, 1, 0, 0, ...
5 7 0.536126 ([],) [4, 5, 0, 5, 5, 0, 4, 0, 4, 0, 5, 2, 5, 4, 4, ...
6 8 0.548035 ([],) [3, 0, 7, 5, 0, 0, 3, 7, 3, 7, 5, 4, 5, 3, 3, ...
7 9 0.541802 ([],) [0, 7, 7, 8, 1, 7, 0, 4, 0, 4, 8, 5, 1, 0, 0, ...
8 10 0.543038 ([],) [3, 7, 0, 4, 7, 0, 3, 0, 9, 0, 4, 1, 4, 9, 3, ...

As expected, the best silhouette score is obtained with two clusters. Let’s set nbr_cluster to 2 and look at how the two removal method compare.

ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True}, remove=False, keep_log=True)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/f619fe1cbdc74a25df66e25bfe944b25210e9f0341a8a3e4e3cc6448359bc9b8.png ../_images/707fa57faba24bc356204d9fc7226cc2ad68cafbada3f2dd6221724ba07019e0.png

Results are identical with previous one using euclidean distance with no outliers being detected.

LOF has two main parameters to consider: n_frac and contamination. Let’s have a look at these two.

First let’s vary n_frac from 0.5 to 0.9 (default)

ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'n_frac':0.5}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'n_frac=0.5'})
ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'n_frac':0.6}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'n_frac=0.6'})
ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'n_frac':0.7}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'n_frac=0.7'})
ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'n_frac':0.8}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'n_frac=0.8'})
ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'n_frac':0.9}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'n_frac=0.9'})
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/c80c432fc5bc577cfc5ff89243e5e1b6951b6dabe661ae812fea346b1733df43.png ../_images/8fc02affb38f966e8faf42b5e6a59764d2f26453bb6a7014968c3d1af3cebe85.png ../_images/835f26c858ebe21e817b7de16202c531df17e4b61d59bdcb057d0ff14091813b.png ../_images/28c7f7276b64841b41d24495604ab3c95df73889a6dcb2f42552f532b036f29e.png ../_images/38ad3d201d69b3dac477e2dba2ffc3b2023330fb6318118a5b2bf118bcd018a0.png ../_images/8d35f8b4f020fb1d7de876abe74ddfe2f602a9de7ee36fff19623664e9a6b776.png ../_images/356bc9b222573e3e661c46874ef3c34a1965a8ba9a65776bff24da4821569e01.png ../_images/bb7dfc9cdbcdafb3fc7224a8408ded55084dbd7d3c290c97dcf5e10eba209138.png ../_images/ca8ab555d510d526c50319ffddf365b4d240b354a81eaec29b995446f2dd9ae0.png ../_images/099298d489596bafaeed45cf4bc448d809c8b164a6b63f5c9a7e5845f586d042.png

Next let’s vary contamination with 0.02, 0.1, 0.3

ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'contamination':0.02}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'contamination=0.02'})
ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'contamination':0.1}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'contamination=0.1'})
ts_out = ts_p.outliers(method='kmeans', settings={'nbr_clusters':2, 'LOF':True, 'contamination':0.3}, remove=False, keep_log=True, plotoutliers_kwargs={'title':'contamination=0.3'})
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/e42bc5a1758fbc941072e816b9e24ace83857d434bbf0ac8e54b0a4280d2fefc.png ../_images/03748f1887a3c050ba94c8c1018b7789106840f12ce3427d5c72538daedc7ba6.png ../_images/e1a68d9d902faf0a5ad1c20c39279527197c5ffc811e1bbd4ac3cf5d97c794c9.png ../_images/2b61a9a596f60f690793fe604c709d32e03167277d8d0467e2c4f52ff8c1a825.png ../_images/c5cf301e7243e55c2f773eaabc504c09a064160c51f441c6afb28f1c0665b123.png ../_images/05a9f6c65141d0781964eab78249d8387cc914504d2f289845b4dc26d6f77acf.png

Outlier detection with DBSCAN#

Let’s turn our attention to DBSCAN:

ts_out = ts_p.outliers(method='DBSCAN', remove=False, keep_log=True)
Optimizing for the best number of clusters, this may take a few minutes
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/48aa1c8c67d748fda5be7d89a5804b27212ccc625a7ba7b8d79091d8fe01cc32.png ../_images/c7ab72790071ea01ce0f53d341b7edf63f9aaf5231224b3763fbedcc86cf69f2.png

The algorithm optimizes for one cluster and two outliers. Let’s have a look at the log, which will contain the silhouette score:

res = ts_out.log[0]['results']
res
eps min_samples number of clusters silhouette score outlier indices clusters
0 0.010000 2 65 0.346318 [2, 8, 11, 13, 14, 16, 21, 25, 26, 28, 29, 30,... [0, 1, -1, 2, 3, 4, 5, 6, -1, 7, 8, -1, 9, -1,...
1 0.010000 3 44 0.185807 [0, 2, 8, 11, 13, 14, 16, 17, 21, 24, 25, 26, ... [-1, 0, -1, 1, 2, 15, 3, 32, -1, 4, 5, -1, 6, ...
2 0.010000 5 15 -0.272447 [0, 2, 3, 5, 7, 8, 9, 10, 11, 13, 14, 16, 17, ... [-1, 0, -1, -1, 1, -1, 13, -1, -1, -1, -1, -1,...
3 0.010000 7 3 -0.280342 [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... [-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1...
4 0.010000 8 2 -0.306025 [0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... [-1, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1...
... ... ... ... ... ... ...
2495 5.665417 77 0 NaN [] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2496 5.665417 79 0 NaN [] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2497 5.665417 80 0 NaN [] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2498 5.665417 82 0 NaN [] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2499 5.665417 84 0 NaN [] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

2500 rows × 6 columns

The log contains the values of eps and min_samples that were tried, the number of resulting clusters, the silhouette score, the indices of the outliers and the cluster assignment.

Let’s look for the maximum value of the silhouette score:

res.loc[res['silhouette score']==np.max(res['silhouette score'])]
eps min_samples number of clusters silhouette score outlier indices clusters
202 0.471667 5 1 0.616233 [145, 201] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
203 0.471667 7 1 0.616233 [145, 201] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
204 0.471667 8 1 0.616233 [145, 201] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Several combinations result in a similar silhouette score (by default, Pyleoclim will work with the first one), with a eps value of 0.47. Let’s make sure that this value is not the maximum that was tried (if so, I would recommend passing your own list of eps value direclty):

res['eps'].max()
5.6654168267442815
res['eps'].min()
0.01

This seems reasonable (and not an artifact of the limits on the parameter sweeps). Note that DBSCAN and k-means both identified a common set of outliers that could be candidates for removal. Generally, if the answers are sensitive to the method and parameters, then it is a good indication that the method is too sensitive and conclusions (i.e., whether the data points are outliers) are not robust. If this is the case, proceed carefully with remove.

%watermark -n -u -v -iv -w
Last updated: Mon Mar 04 2024

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.20.0

pyleoclim: 0.13.1b0

Watermark: 2.4.3