Meet Jackson Green

Jackson Green

What is your personal & professional background?

I was interested in computers and technology from a very young age, I started teaching myself the basic tools for web design and development in high school and took all the classes they offered on the subject. I followed this passion to college where I received a B.S. in Computer Science, as well as a B.A. in Theatre. Right after graduating in 2018 I started work as a Software Developer at CRST International, where I continue creating in house web applications and web page design specifications.

How did you get connected with the Matrix Profile and MPF?

Well working at CRST I met Tyler Mars, who asked me if I would be willing to design a logo for an open source project he was working on at the time. After creating a logo, he asked me to make a website showcasing the people who had been working on that project, and after some time they asked me to join the team and be responsible for maintaining the design of the web site and branding materials. That open source project was MPF.

What do you do at MPF?

I oversee the website design, logo, and brand colors. The title we landed on Chief Design Officer, which I really like, but probably means you see my name on any commits to the organization’s main libraries… for now.

What excites you about the future of Matrix Profile and the MPF?

I am new to the matrix profile as a whole, so the thing that excites me most is learning more about it. I have been shown how useful a tool like this can be in so many different areas, and I can’t wait for it to reach its full potential.

ECG Heartbeat Analysis (Python)

Discovering Motifs with Matrix Profiles and Annotation Vectors

ECG Motifs - Annotation Vectors

Annotation vectors are a series of numbers in the range [0, 1] which correspond to how significant a motif is at that index. For example, a 1 in the AV means that any motif starting at that index is heavily important and should be conserved whereas a 0 means that the motif can be discarded or ignored. As a result, annotation vectors allow you to ignore stop words or insigificant patterns in your data. This example shows you the basics of using annotation vectors on real-world data to select for specific motifs or patterns.

Data Overview

The data is a snippet from a large collection of ECG heartbeat data from the LTAF-71 database. The first half of this time series contains the calibration signal whereas the second half contains the actual ECG heartbeat.

In [1]:
import matrixprofile as mp
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline
In [2]:
ecg = mp.datasets.load('ecg-heartbeat-av')
ts = ecg['data']
window_size = 150

Motifs with Matrix Profile

Let's compute the regular matrix profile and use that to find our top motif.

In [3]:
profile = mp.compute(ts, windows=window_size)
profile = mp.discover.motifs(profile, k=1)
In [4]:
figures = mp.visualize(profile)

As can be seen above, the top motif is the calibration signal. The reason why is because the calibration signal is more well-conserved than the heartbeat signal, which may have variance due to the slight irregularity of a heartbeat. However, since we have the domain-specific knowledge that the calibration signal should be ignored, we can use annotation vectors to help us discard the calibration signal motif.

In [5]:
# note the calibration signal starts to fade after the 1200th data point
threshold = 1200

# fill in first 1200 data points with 0s and rest with 1s to indicate
# that the calibration signal is not as important as heartbeat
av = np.append(np.zeros(threshold), np.ones(len(profile['mp']) - threshold))

Motifs with Corrected Matrix Profile

Now we can apply the AV that we created to our original matrix profile and get a "corrected" matrix profile, or CMP. We can then use this CMP to re-discover the top motif.

In [6]:
profile = mp.transform.apply_av(profile, "custom", av)
profile = mp.discover.motifs(profile, k=1, use_cmp=True)
In [7]:
figures = mp.visualize(profile)

Note how the top motif is now the heartbeat motif, which is the one we wanted to select. With domain-specific knowledge about the time series you are analyzing, annotation vectors can be an important tool in selecting for motifs that are important and ignoring motifs that are irrelevant.

Meet Francisco Bischoff

Francisco Bischoff

What is your personal & professional background?

I'm a medical doctor specialized in Imunohemoteraphy. I'm a computer enthusiast since childhood and learned to program when I was about 15 yo. I have a Master in Medical Informatics and currently enrolled in a PhD for Health Data Science.

How did you get connected with the Matrix Profile and MPF?

After my MSc. I've started to study deeply time-series and looking for new methods. That's when I crossed with some Eammon's papers and after some e-mail exchanges, I got hooked by The Matrix.

What do you do at MPF?

I'm working in the R language branch in MPF with the tsmp package, as well as being involved with the API design and implementing low-level algorithms.

What excites you about the future of Matrix Profile and the MPF?

I see great potential in time-series data mining through the matrix profile. The MPF was a great achievement and will help to push the matrix profile to new boundaries.

How To Painlessly Analyze Your Time Series

An Introduction to MPA: the Matrix Profile API

We’re surrounded by time series data. From finance to IoT to marketing, many organizations produce thousands of these metrics and mine them to uncover business-critical insights. A Site Reliability Engineer might monitor hundreds of thousands of time series streams from a server farm, in the hopes of detecting anomalous events and preventing catastrophic failure. Alternatively, a brick and mortar retailer might care about identifying patterns of customer foot traffic and leveraging them to guide inventory decisions.

Identifying anomalous events (or “discords”) and repeated patterns (“motifs”) are two fundamental time series tasks. But how does one get started? There are dozens of approaches to both questions, each with unique positives and drawbacks.

Furthermore, time series data is notoriously hard to analyze, and the explosive growth of the data science community has led to a need for more “black-box” automated solutions that can be leveraged by developers with a wide range of technical backgrounds.

pic #1

We at the Matrix Profile Foundation believe there’s an easy answer. While it’s true that there’s no such thing as a free lunch, the Matrix Profile (a data structure & set of associated algorithms developed by the Keogh research group at UC-Riverside) is a powerful tool to help solve this dual problem of anomaly detection and motif discovery. Matrix Profile is robust, scalable, and largely parameter-free: we’ve seen it work for a wide range of metrics including website user data, order volume and other business-critical applications.

As we will detail below, the Matrix Profile Foundation has implemented the Matrix Profile across three of the most common data science languages (Python, R and Golang) as an easy-to-use API that’s relevant for time series novices and experts alike

So What is the Matrix Profile?

The basics of Matrix Profile are simple: If I take a snippet of my data and slide it along the rest of the time series, how well does it overlap at each new position? More specifically, we can evaluate the Euclidean distance between a subsequence and every possible time series segment of the same length, building up what’s known as the snippet’s “Distance Profile.”

If the subsequence repeats itself in the data, there will be at least one perfect match and the minimum Euclidean distance will be zero (or close to zero in the presence of noise). In contrast, if the subsequence is highly unique (say it contains a significant outlier), matches will be poor and all overlap scores will be high. Note that the type of data is irrelevant: We’re only looking at general pattern conservation.

We then slide every possible snippet across the time series, building up a collection of Distance Profiles. By taking the minimum value for each time step across all distance profiles, we can build the final Matrix Profile. Notice that both ends of the Matrix Profile value spectrum are useful. High values indicate uncommon patterns or anomalous events; in contrast, low values highlight repeatable motifs and provide valuable insight into your time series of interest.

Pic # 2

For those interested, this post by one of our co-founders provides a more in-depth discussion of the Matrix Profile.

Although the Matrix Profile can be a game-changer for time series analysis, leveraging it to produce insights is a multi-step computational process, where each step requires some level of domain experience. However, we believe that the most powerful breakthroughs in data science occur when the complex is made accessible. When it comes to the Matrix Profile, there are three facets to accessibility: “out-of-the-box” working implementations, gentle introductions to core concepts that can naturally lead into deeper exploration, and multi-language accessibility.

Today, we’re proud to unveil the Matrix Profile API (MPA), a common codebase written in R, Python and Golang that achieves all three of these goals.

MPA: how it works

Using the Matrix Profile consists of three steps.

First, you Compute the Matrix Profile itself. However, this is not the end: you need to Discover something by leveraging the Matrix Profile that you’ve created. Do you want to find repeated patterns? Or perhaps uncover anomalous events? Finally, it’s critical that you Visualize your findings, as time series analysis greatly benefits from some level of visual inspection.

Normally, you’d need to read through pages of documentation (both academic and technical) to figure out how to execute each of these three steps. This may not be a challenge if you’re an expert with prior knowledge of the Matrix Profile, but we’ve seen that many users simply want to Analyze their data by breaking through the methodology to get to a basic starting point. Can the code simply leverage some reasonable defaults to produce a reasonable output?

To parallel this natural computational flow, MPA consists of three core components:

  1. Compute (computing the Matrix Profile)

  2. Discover (evaluate the MP for motifs, discords, etc)

  3. Visualize (display results through basic plots)

These three capabilities are wrapped up into a high-level capability called Analyze. This is a user-friendly interface that enables people who know nothing about the inner workings of Matrix Profile to quickly leverage it for their own data. And as users gain more experience and intuition with MPA, they can easily dive deeper into any of the three core components to make further functional gains.

MPA: a toy example

As an example, we’ll use the Python flavor of MPA to analyze the synthetic time series shown below:

In [ ]:
from matplotlib import pyplot as plt
import numpy as np
import matrixprofile as mp

%matplotlib inline

dataset = mp.datasets.load('motifs-discords-small')
vals = dataset['data']

fig,ax = plt.subplots(figsize=(20,10))
ax.plot(np.arange(len(vals)),vals, label = 'Test Data')

Pic # 3

Visual inspection reveals that there are both patterns and discords present. However, one immediate problem is that your choice of subsequence length will change both the number and location of your motifs! Are there only two sinusoidal motifs present between indices 0–500, or is each cycle an instance of the pattern? Let’s see how MPA handles this challenge:

In [ ]:
profile, figures = mp.analyze(vals)

Pic # 4

Because we haven’t specified any information regarding our subsequence length, analyze begins by leveraging a powerful calculation known as the pan-matrix profile (or PMP) to generate insights that will help us evaluate different subsequence lengths. We’ll discuss the details of PMP in a later post (or you can read the associated paper), but in a nutshell, it is a global calculation of all possible subsequence lengths condensed into a single visual summary. The X-axis is the index of the matrix profile, and the Y-axis is the corresponding subsequence length. The darker the shade, the lower the Euclidean distance at that point. We can use the “peaks” of the triangles to find the 6 “big” motifs visually present in the synthetic time series.

The PMP is all well and good, but we promised a simple way of understanding your time series. To facilitate this, analyze will combine PMP with an under the hood algorithm to choose sensible motifs and discords from across all possible window sizes. The additional graphs created by analyze show the top three motifs and top three discords, along with the corresponding window size and position within the Matrix Profile (and, by extension, your time series).

Pic # 6

Pic # 7

Pic # 8

Pic # 9

Pic # 10

Not surprisingly, this is a lot of information coming out of the default setting. Our goal is that this core function call can serve as a jumping-off point for many of your future analyses. For example, the PMP indicates that there is a conserved motif of length ~175 within our time series. Try calling analyze on that subsequence length and see what happens!

In [ ]:
profile, figures = mp.analyze(vals,windows=175)

Wrapping Up

We hope that MPA enables you to more painlessly analyze your time series, and please star our GitHub repos if you find the code useful! MPF also operates a Discord channel where you can engage with fellow users of the Matrix Profile and ask questions. Happy time series hunting!

Acknowledgements

Thank you to Tyler Marrs, Frankie Cancino, Francisco Bischoff, Austin Ouyang and Jack Green for reviewing this article and assisting in its creation. And above all, thank you to Eamonn Keogh, Abdullah Mueen and their numerous graduate students for creating the Matrix Profile and continuing to drive its development.

Supplemental

  1. Matrix Profile research papers can be found on Eamonn Keogh’s UCR web page:

https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

  1. The Python implementation of Matrix Profile algorithms can be found here:

https://github.com/matrix-profile-foundation/matrixprofile

  1. The R implementation of Matrix Profile algorithms can be found here:

https://github.com/matrix-profile-foundation/tsmp

  1. The Golang implementation of Matrix Profile algorithms can be found here:

https://github.com/matrix-profile-foundation/go-matrixprofile

Meet Andrew Van Benschoten

Andrew Van Benschoten

What is your personal & professional background?

I'm a recovering academic: I earned my Bachelors degree in Biology from MIT, then went to the University of California - San Francisco for a Ph.D in Biophysics. I graduated in 2015, right when data science was emerging as the hot new career in Silicon Valley. Since my thesis leveraged a significant amount of computational biology it seemed like a perfect fit, and I joined the Insight Data Science Fellows program to help with my transition. From there I went to Oracle (digital advertising), Target (data measurement & telemetry plus lots of DevOps) and am now the Head of Data Science at Ovative Group (back to digital advertising).

How did you get connected with the Matrix Profile and MPF?

At Target, Frankie and I worked on a team responsible for detecting anomalies across thousands of disparate time series, from key business metrics to server farm performance stats. This led us to discover the Matrix Profile and ultimately open-source the original matrixprofile-ts library. This then led to subsequent connections with other authors of matrix profile implementations. After a few months of working together we decided to make things more official and form a cohesive, unified group known as the Matrix Profile Foundation.

What do you do at MPF?

I play the role of CEO on our leadership board.: My primary job is to facilitate development of MPF's vision & direction, and then identify the right strategy to get us there. However, I'm also an organizational nerd which means I double as our unofficial Scrum Master. And I still do a little Python programming wherever possible.

What excites you about the future of Matrix Profile and the MPF?

Big Data is only getting bigger. We're just scratching the surface of IoT and connected applications, which means that there will be a massive need for high-performing algorithms both now and in the future. I'm a little biased, but I think there's a 5% chance that Matrix Profile will be the "next big thing" in Data Science, and I'm excited to help bring that to fruition. It's also been a great experience being a part of the broader Open Source community and engaging with so many fascinating applications.

NYC Taxi Analysis - Anomalies (Python)

Practical Discord Discovery with Matrix Profiles

In this quick tutorial, you will learn how to use the Python library "matrixprofile" to detect anomalies within the NYC Taxi dataset. This dataset is composed of passenger counts within 30 minute intervals from 2014-07-01 to 2015-01-31. There are 5 known anomalies within this dataset corresponding to these events:

  • NYC Marathon - 2014-11-02
  • Thanksgiving - 2014-11-27
  • Christmas - 2014-12-25
  • New Years - 2015-01-01
  • Snow Blizzard - 2015-01-26 and 2015-01-27

Library Imports and Data Loading

In this section the libraries and data used throughout the tutorial are loaded.

In [1]:
import matrixprofile as mp

# ignore matplotlib warnings
import warnings
warnings.filterwarnings("ignore")

from matplotlib import pyplot as plt

%matplotlib inline
In [2]:
dataset = mp.datasets.load('nyc-taxi-anomalies.csv')

Visualize Data

Here we simply visualize the raw data to understand what we are working with.

In [3]:
plt.figure(figsize=(15,3))
plt.plot(dataset['datetime'], dataset['data'])
plt.title('NYC Taxi Passenger Counts')
plt.ylabel('Passenger Count')
plt.xlabel('Datetime')
plt.tight_layout()
plt.show()

Compute Matrix Profile, Discover Discords and Visualize

Since the dataset is in 30 minute intervals and we care to find daily events, we compute the Matrix Profile with a window size of 48. Once the Matrix Profile is computed, discords can be found using it. By default the algorithm uses the same exclusion zone used to compute the Matrix Profile (0.25 * window size for the default compute algorithm). This excludes about 6 hours of time when finding non-trivial discords.

In [4]:
window_size = 48
profile = mp.compute(dataset['data'], windows=window_size)
profile = mp.discover.discords(profile, k=5)
In [5]:
mp.visualize(profile)
plt.show()
In [6]:
for dt in dataset['datetime'][profile['discords']]:
    print(dt)
2015-01-27T09:00:00
2015-01-26T13:00:00
2014-11-02T00:30:00
2014-11-01T04:00:00
2015-01-26T19:30:00

Discord Discovery Tuning

Based on this exclusion zone you can see that we get duplicate anomalies on the same day. However, we want to find unique days. We can do this by adjusting our exclusion zone to an entire day.

In [7]:
profile = mp.discover.discords(profile, exclusion_zone=window_size, k=5)

Instead of having all of the plots available generated for us, we can simply import the function to plot the discords.

In [8]:
from matrixprofile.visualize import plot_discords_mp
In [9]:
plot_discords_mp(profile)
plt.show()