Clustering: Computing the Pairwise Distance Matrix

Learn how to compute a MPDist based pairwise distance matrix for clustering.

This is a quick code tutorial that demonstrates how you can compute the MPDist based pairwise distance matrix. This distance matrix can be used in any clustering algorithm that allows for a custom distance matrix.

In [1]:
from matrixprofile.algorithms.hierarchical_clustering import pairwise_dist
import numpy as np
In [2]:
%pdoc pairwise_dist
Class docstring:
    Utility function to compute all pairwise distances between the timeseries
    using MPDist. 
    
    Note
    ----
    scipy.spatial.distance.pdist cannot be used because they
    do not allow for jagged arrays, however their code was used as a reference
    in creating this function.
    https://github.com/scipy/scipy/blob/master/scipy/spatial/distance.py#L2039
    
    Parameters
    ----------
    X : array_like
        An array_like object containing time series to compute distances for.
    window_size : int
        The window size to use in computing the MPDist.
    threshold : float
        The threshold used to compute MPDist.
    n_jobs : int
        Number of CPU cores to use during computation.
    
    Returns
    -------
    Y : np.ndarray
        Returns a condensed distance matrix Y.  For
        each :math:`i` and :math:`j` (where :math:`i<j<m`),where m is the 
        number of original observations. The metric ``dist(u=X[i], v=X[j])``
        is computed and stored in entry ``ij``.
Call docstring:
    Call self as a function.

This function computes a condensed distance matrix for all time series of interest. Below is an example of computing the distance matrix on a handful of randomly generated time series.

In [3]:
# generate 5 random time series

data = []
size = 100

for _ in range(5):
    data.append(np.random.uniform(size=size))
In [4]:
window_size = 8
n_jobs = 4

distance_matrix = pairwise_dist(data, window_size=window_size, n_jobs=n_jobs)
In [5]:
distance_matrix
Out[5]:
array([1.2334854 , 1.13236744, 1.124416  , 1.17065294, 1.14144607,
       1.2107359 , 1.08488366, 1.09598017, 0.98853814, 0.98214056])

Converting to Square Form

Some clustering algorithms require the distance matrix to be square. In this case, we simply convert it.

In [6]:
from scipy.spatial.distance import squareform
In [7]:
square_distance_matrix = squareform(distance_matrix)
In [8]:
square_distance_matrix
Out[8]:
array([[0.        , 1.2334854 , 1.13236744, 1.124416  , 1.17065294],
       [1.2334854 , 0.        , 1.14144607, 1.2107359 , 1.08488366],
       [1.13236744, 1.14144607, 0.        , 1.09598017, 0.98853814],
       [1.124416  , 1.2107359 , 1.09598017, 0.        , 0.98214056],
       [1.17065294, 1.08488366, 0.98853814, 0.98214056, 0.        ]])

Comments

Comments powered by Disqus