HDBScan Clustering with MPDist

Learn how to use MPDist metric with the HDBScan clustering algorithm.

Tyler Marrs

July 17, 2020

Comments

clustering hdbscan matrixprofile python tutorial

HDBScan is a newer clustering algorithm merging concepts from hierarchical clustering and DBScan into play. You should read the following paper for more details about this algorithm - https://arxiv.org/abs/1911.02282

This notebook illustrates how to use MPDist with HDBScan. The specific implementation of HDBScan may be found here - https://hdbscan.readthedocs.io/en/latest/

This example is simple in nature. Random walk and incremental time series are generated to illustrate implementation.

In [1]:

import hdbscan
from matrixprofile.algorithms.hierarchical_clustering import pairwise_dist
from scipy.spatial.distance import squareform

import numpy as np

In [2]:

np.random.seed(9999)

In [3]:

data = []
size = 100

random_ts = np.random.uniform(size=size)

for _ in range(5):
    data.append(np.copy(random_ts))

data.append(np.arange(100))
data.append(np.arange(100))
data.append(np.arange(100))

In [4]:

window_size = 8
n_jobs = 4

distance_matrix = pairwise_dist(data, window_size=window_size, n_jobs=n_jobs)

In [5]:

square_distance_matrix = squareform(distance_matrix)

In [6]:

clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
clusterer.fit(square_distance_matrix)
clusterer.labels_

Out[6]:

array([0, 0, 0, 0, 0, 1, 1, 1])

Here we see that the first 5 time series are clustered together and the latter 3 are clustered together as expected.