How to Preprocess Your Time Series
Handling missing data and singular continuous values
Why do we need to pre-process time series data?¶
In most cases, when we feed complete data into compute and analyze functions to compute matrix profiles and analyze time series, all matrix profile algorithms would work well. However, it is inevitable that we have to work with some time series with missing values. In this case, if we don't take steps to handle these missing values, we will get into trouble calculating and analyzing matrix profiles.
The following is an example that will give you an idea of one of the potential problems caused by missing values. To create a time series containing missing data, we insert 'np.nan' and 'np.inf' into the sample dataset used in Quickstart Guide. After that, we try to invoke compute and analyze functions to analyze the time series we created.
# Import Library
import matrixprofile as mp
import numpy as np
from matplotlib import pyplot as plt
# ignore matplotlib warnings
import warnings
warnings.filterwarnings("ignore")
# Load Data
dataset = mp.datasets.load('motifs-discords-small')
ts = dataset['data']
# Add missing data to the original time series
ts[99] = np.nan
ts[199] = np.inf
# Compute and analyze the MatrixProfile
profile = mp.compute(ts, windows=32)
print(profile['mp'][:120])
profile, figures = mp.analyze(ts, windows=32)
We can see from the output and the first two plots that as long as we have missing data somewhere in the time series, the matrix profile values would become constant from the starting position of the subsequence containing the first missing data (which is 68 for this example). You may also see the problems we've run into easily by comparing the above results with the results shown in Quickstart Guide. Clearly, this matrix profile is meaningless, and we are not able to identify the correct motifs and discords based on such a matrix profile.
Therefore, in order to address the potential issues caused by missing data, we introduce a new preprocess module that can help you deal with these issues in some cases.
Introduce a preprocessing procedure to avoid potential problems with computing and analyzing MatrixProfiles¶
Enable preprocessing in compute and analyze functions¶
Compute and analyze methods now accept a new parameter preprocessing_kwargs of type dict to enable preprocessing. A valid preprocessing_kwargs has the following data structure:
{
# The window size of type int to compute the mean/median/minimum/
# maximum value. The default is 4.
'window': 4,
# A string indicating the data imputation method, which should be
#'mean', 'median', 'min' or 'max'. The default is 'mean'.
'method': 'mean',
# A string indicating the data imputation direction, which should be
# 'forward', 'fwd', 'f', 'backward', 'bwd', 'b'. If the direction is
# forward, we use previous data for imputation; if the direction is
# backward, we use subsequent data for imputation.
# The defualt is 'forward'.
'direction': 'forward',
# A boolean value indicating whether noise needs to be added into the
# time series. The default is True.
'add_noise': True
}
After defining the preprocessing_kwargs, we can enable the preprocessing procedures to impute the missing data in time series. Let's go back to the time series we just created in the previous section and see what will happen to the results after passing in the parameter preprocessing_kwargs to the compute and analyze methods.
preprocessing_kwargs = {
'window': 3,
'impute_method': 'max',
'impute_direction': 'forward',
'add_noise': False
}
# Compute and analyze the MatrixProfile
profile = mp.compute(ts, windows=32, preprocessing_kwargs=preprocessing_kwargs)
print(profile['mp'][:120])
profile, figures = mp.analyze(ts, windows=32, preprocessing_kwargs=preprocessing_kwargs)
In this example, we use the maximum value in the sliding window for imputation. As you can see from the output and the figures above, by starting the preprocessing procedure to impute missing data, this time the matrix profile no longer has constant intervals and works properly.
What's more, the results of motifs and discords are consistent with those shown in Quickstart Guide, and there is no significant difference between the matrix profile in this example and that for the original time series.
Invoke the preprocess module directly¶
If you simply want to preprocess your time series data without computing and analyzing matrix profiles, we provide you with another option, which is to invoke the preprocess module directly. To accommodate your different needs for data processing, the preprocess module has the following features:
- Preprocessing
- Data imputation
- Adding noise
- Constant Value Detection
In the following sections, we will illustrate each one of them with examples.
matrixprofile.preprocess.preprocess¶
Description¶
Imputes missing data in time series, and adds noise if the data within the sliding window are constant.
Parameters¶
Parameters | Type | Description |
---|---|---|
ts | array_like | The time series to be preprocessed. |
window | int | The window size to compute the mean/median/minimum value/maximum value. |
impute_method | string, default = 'mean' | A string indicating the data imputation method, which should be 'mean', 'median', 'min' or 'max'. |
impute_direction | string, default = 'forward' | A string indicating the data imputation direction, which should be 'forward', 'fwd', 'f', 'backward', 'bwd', 'b'. If the direction is forward, we use previous data for imputation; if the direction is backward, we use subsequent data for imputation. |
add_noise | bool, default = True | A boolean value indicating whether noise needs to be added into the time series. |
Returns¶
Returns | Type | Description |
---|---|---|
temp | array_like | The preprocessed time series. |
Examples¶
from matrixprofile.preprocess import preprocess
ts = np.array([np.nan, np.inf, np.nan, 2, 3, 4, 5, np.nan,
np.inf, np.inf, np.nan, 1, 1, 1, 1])
preprocess(ts, window=4, impute_method='mean', impute_direction='fwd', add_noise=True)
ts = np.array([1, 1, 1, 1, np.nan, np.inf, 6, 4, 5, np.inf, np.nan, 2, 1])
preprocess(ts, window=3, impute_method='median', impute_direction='b', add_noise=True)
matrixprofile.preprocess.impute_missing¶
Description¶
Imputes missing data in time series.
Parameters¶
Parameters | Type | Description |
---|---|---|
ts | array_like | The time series to be handled. |
window | int | The window size to compute the mean/median/minimum value/maximum value. |
method | string, default = 'mean' | A string indicating the data imputation method, which should be 'mean', 'median', 'min' or 'max'. |
direction | string, default = 'forward' | A string indicating the data imputation direction, which should be 'forward', 'fwd', 'f', 'backward', 'bwd', 'b'. If the direction is forward, we use previous data for imputation; if the direction is backward, we use subsequent data for imputation. |
Returns¶
Returns | Type | Description |
---|---|---|
temp | array_like | The time series after being imputed missing data. |
Examples¶
from matrixprofile.preprocess import impute_missing
ts = np.array([np.nan, np.inf, np.nan, 2, 3, 4, 5, np.nan,
np.inf, np.inf, np.nan, 1, 1, 1, 1])
impute_missing(ts, window = 4, method='mean', direction='f')
ts = np.array([1, 1, 1, 1, np.nan, np.inf, 6, 4, 5, np.inf, np.nan, 2, 1])
impute_missing(ts, window=3, method='median', direction='b')
matrixprofile.preprocess.add_noise_to_series¶
Description¶
Adds noise to the given time series.
Parameters¶
Parameters | Type | Description |
---|---|---|
series | array_like | The time series subsequence to be added noise. |
Returns¶
Returns | Type | Description |
---|---|---|
temp | array_like | The time series subsequence with noise added. |
Examples¶
from matrixprofile.preprocess import add_noise_to_series
ts = np.array([1, 1, 1, 1, 1])
add_noise_to_series(ts)
matrixprofile.preprocess.is_subsequence_constant¶
Description¶
Determines whether the given time series subsequence is an array of constants.
Parameters¶
Parameters | Type | Description |
---|---|---|
subsequence | array_like | The time series subsequence to analyze. |
Returns¶
Returns | Type | Description |
---|---|---|
is_constant | bool | A boolean value indicating whether the given subsequence is an array of constants. |
Examples¶
from matrixprofile.preprocess import is_subsequence_constant
ts = np.array([1, 1, 1, 1, 1])
is_subsequence_constant(ts)
ts = np.array([1, 2, 1, 1, 1, 1])
is_subsequence_constant(ts)
Limitations of the current data imputation method¶
If the missing data in a time series are continuous and in large quantity, it is difficult for our current data imputation method to provide accurate predictions of the missing values. As can be seen from the figure below, when the amount of missing data increases to 200, the dispersion of the imputed data is fairly low. That's because the imputed data is only affected by other available data within the window. To utilize the data farther away from the missing data for imputation, we may need to introduce some more intelligent algorithms to learn the overall pattern of the time series or the long-range dependencies between the data, which are beyond the capability of the current implementation.
# Load Data
dataset = mp.datasets.load('motifs-discords-small')
ts = dataset['data']
# Add missing data to the original time series
ts[99:200] = np.nan
ts = preprocess(ts, window=100)
plt.figure(figsize=(20,6))
plt.plot(ts,'g')
plt.show()
Comments
Comments powered by Disqus