NYC Taxi Analysis - Anomalies (Python)

Practical Discord Discovery with Matrix Profiles

In this quick tutorial, you will learn how to use the Python library "matrixprofile" to detect anomalies within the NYC Taxi dataset. This dataset is composed of passenger counts within 30 minute intervals from 2014-07-01 to 2015-01-31. There are 5 known anomalies within this dataset corresponding to these events:

  • NYC Marathon - 2014-11-02
  • Thanksgiving - 2014-11-27
  • Christmas - 2014-12-25
  • New Years - 2015-01-01
  • Snow Blizzard - 2015-01-26 and 2015-01-27

Library Imports and Data Loading

In this section the libraries and data used throughout the tutorial are loaded.

In [1]:
import matrixprofile as mp

# ignore matplotlib warnings
import warnings
warnings.filterwarnings("ignore")

from matplotlib import pyplot as plt

%matplotlib inline
In [2]:
dataset = mp.datasets.load('nyc-taxi-anomalies.csv')

Visualize Data

Here we simply visualize the raw data to understand what we are working with.

In [3]:
plt.figure(figsize=(15,3))
plt.plot(dataset['datetime'], dataset['data'])
plt.title('NYC Taxi Passenger Counts')
plt.ylabel('Passenger Count')
plt.xlabel('Datetime')
plt.tight_layout()
plt.show()

Compute Matrix Profile, Discover Discords and Visualize

Since the dataset is in 30 minute intervals and we care to find daily events, we compute the Matrix Profile with a window size of 48. Once the Matrix Profile is computed, discords can be found using it. By default the algorithm uses the same exclusion zone used to compute the Matrix Profile (0.25 * window size for the default compute algorithm). This excludes about 6 hours of time when finding non-trivial discords.

In [4]:
window_size = 48
profile = mp.compute(dataset['data'], windows=window_size)
profile = mp.discover.discords(profile, k=5)
In [5]:
mp.visualize(profile)
plt.show()
In [6]:
for dt in dataset['datetime'][profile['discords']]:
    print(dt)
2015-01-27T09:00:00
2015-01-26T13:00:00
2014-11-02T00:30:00
2014-11-01T04:00:00
2015-01-26T19:30:00

Discord Discovery Tuning

Based on this exclusion zone you can see that we get duplicate anomalies on the same day. However, we want to find unique days. We can do this by adjusting our exclusion zone to an entire day.

In [7]:
profile = mp.discover.discords(profile, exclusion_zone=window_size, k=5)

Instead of having all of the plots available generated for us, we can simply import the function to plot the discords.

In [8]:
from matrixprofile.visualize import plot_discords_mp
In [9]:
plot_discords_mp(profile)
plt.show()
In [10]:
for dt in dataset['datetime'][profile['discords']]:
    print(dt)
2015-01-27T09:00:00
2014-11-02T00:30:00
2015-01-25T20:30:00
2014-12-31T05:30:00
2014-07-03T07:00:00

Wrapping Up

Using the newly defined exclusion zone we are able to identify the days corresponding to the anomalous events. Another approach would be to simply include more discords within the search space. However, to find the defined anomalies, only unique days are of interest.

It is important to note that the window size and exclusion zone must be set correctly to find what you are interested in. For example, I could be interested in anomalous days within a week period. In this case I would simply increase the exclusion zone to roughly 3 days. This is because the exclusion zone is based on before and after the discord.

Comments

Comments powered by Disqus