Computes the Matrix Profile and Profile Index

This is a wrap function that makes easy to use all available algorithms to compute the Matrix Profile and Profile Index for multiple purposes.

tsmp(
  ...,
  window_size,
  exclusion_zone = getOption("tsmp.exclusion_zone", 1/2),
  mode = c("stomp", "stamp", "simple", "mstomp", "scrimp", "valmod", "pmp"),
  verbose = getOption("tsmp.verbose", 2),
  n_workers = 1,
  s_size = Inf,
  must_dim = NULL,
  exc_dim = NULL,
  heap_size = 50,
  paa = 1,
  .keep_data = TRUE
)

Arguments

...: a matrix or a vector. If a second time series is supplied it will be a join matrix profile (except for mstomp()).
window_size: an int with the size of the sliding window. Use a vector for Valmod.
exclusion_zone: a numeric. Size of the exclusion zone, based on window size (default is 1/2). See details.
mode: the algorithm that will be used to compute the matrix profile. (Default is stomp). See details.
verbose: an int. (Default is 2). See details.
n_workers: an int. Number of workers for parallel. (Default is 1).
s_size: a numeric. for anytime algorithm, represents the size (in observations) the random calculation will occur (default is Inf). See details.
must_dim: an int or vector of which dimensions to forcibly include (default is NULL). See details.
exc_dim: an int or vector of which dimensions to exclude (default is NULL). See details.
heap_size: an int. (Default is 50). Size of the distance profile heap buffer.
paa: an int. (Default is 1). Factor of PAA reduction (2 == half of size)
.keep_data: a logical. (Default is TRUE). Keeps the data embedded to resultant object.

Value

Returns the matrix profile mp and profile index pi. It also returns the left and right matrix profile lmp, rmp and profile index lpi, rpi that may be used to detect Time Series Chains. mstomp() returns a multidimensional Matrix Profile.

Details

The Matrix Profile, has the potential to revolutionize time series data mining because of its generality, versatility, simplicity and scalability. In particular it has implications for time series motif discovery, time series joins, shapelet discovery (classification), density estimation, semantic segmentation, visualization, rule discovery, clustering etc.

The first algorithm invented was the stamp() that using mass() as an ultra-fast Algorithm for Similarity Search allowed to compute the Matrix Profile in reasonable time. One of its main feature was its Anytime property which using a randomized approach could return a "best-so-far" matrix that could give us the correct answer (using for example 1/10 of all iterations) almost every time.

The next algorithm was stomp() that currently is the most used. Researchers noticed that the dot products do not need to be recalculated from scratch for each subsequence. Instead, we can reuse the values calculated for the first subsequence to make a faster calculation in the next iterations. The idea is to make use of the intersections between the required products in consecutive iterations. This approach reduced the time to compute the Matrix Profile to about 3% compared to stamp(), but on the other hand, we lost the Anytime property.

Currently there is a new algorithm that I'll not explain further here. It is called scrimp(), and is as fast as stomp(), and have the Anytime property. This algorithm is implemented in this package, but still waiting for an article publication.

Further, there is the mstomp() that computes a multidimensional Matrix Profile that allows to meaningful MOTIF discovery in Multivariate Time Series. And simple_fast() that also handles Multivariate Time Series, but focused in Music Analysis and Exploration.

The valmod() uses a new pruning algorithm allowing a similarity search with a range of sliding window sizes.

The pmp() is a new concept that creates several profiles from a range of windows.

Some parameters are global across the algorithms:

...: One or two time series (except for mstomp()). The second time series can be smaller than the first.
window_size: The sliding window.
exclusion_zone: Is used to avoid trivial matches; if a query data is provided (join similarity), this parameter is ignored.
verbose: Changes how much information is printed by this function; 0 means nothing, 1 means text, 2 adds the progress bar, 3 adds the finish sound.
n_workers: number of threads for parallel computing (except simple_fast, scrimp and valmod). If the value is 2 or more, the '_par' version of the algorithm will be used.

s_size is used only in Anytime algorithms: stamp() and scrimp(). must_dim and exc_dim are used only in mstomp(). heap_size is used only for valmod() mode can be any of the following: stomp, stamp, simple, mstomp, scrimp, valmod, pmp.

References

Silva D, Yeh C, Batista G, Keogh E. Simple: Assessing Music Similarity Using Subsequences Joins. Proc 17th ISMIR Conf. 2016;23-30.

Silva DF, Yeh C-CM, Zhu Y, Batista G, Keogh E. Fast Similarity Matrix Profile for Music Analysis and Exploration. IEEE Trans Multimed. 2018;14(8):1-1.

Yeh CM, Kavantzas N, Keogh E. Matrix Profile VI : Meaningful Multidimensional Motif Discovery.

Yeh CCM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, et al. Matrix profile I: All pairs similarity joins for time series: A unifying view that includes motifs, discords and shapelets. Proc - IEEE Int Conf Data Mining, ICDM. 2017;1317-22.

Zhu Y, Imamura M, Nikovski D, Keogh E. Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining. Knowl Inf Syst. 2018 Jun 2;1-27.

Zhu Y, Zimmerman Z, Senobari NS, Yeh CM, Funning G. Matrix Profile II : Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. Icdm. 2016 Jan 22;54(1):739-48.

Website: https://sites.google.com/view/simple-fast

Website: https://sites.google.com/site/ismir2016simple/home

Website: http://www.cs.ucr.edu/~eamonn/MatrixProfile.html

Examples

# default with [stomp()]
mp <- tsmp(mp_toy_data$data[1:200, 1], window_size = 30, verbose = 0)

# Anytime STAMP
mp <- tsmp(mp_toy_data$data[1:200, 1], window_size = 30, mode = "stamp", s_size = 50, verbose = 0)

# [mstomp()]
mp <- tsmp(mp_toy_data$data[1:200, ], window_size = 30, mode = "mstomp", verbose = 0)

# [simple_fast()]
mp <- tsmp(mp_toy_data$data[1:200, ], window_size = 30, mode = "simple", verbose = 0)
# \donttest{
# parallel with [stomp_par()]
mp <- tsmp(mp_test_data$train$data[1:1000, 1], window_size = 30, n_workers = 2, verbose = 0)
#> Error in {    work_len <- length(idx_work[[i]])    pro_muls <- matrix(Inf, matrix_profile_size, 1)    pro_idxs <- matrix(-Inf, matrix_profile_size, 1)    if (join) {        pro_muls_right <- pro_muls_left <- NULL        pro_idxs_right <- pro_idxs_left <- NULL    }    else {        pro_muls_right <- pro_muls_left <- pro_muls        pro_idxs_right <- pro_idxs_left <- pro_idxs    }    dist_pro <- matrix(0, matrix_profile_size, 1)    last_product <- matrix(0, matrix_profile_size, 1)    drop_value <- matrix(0, 1, 1)    for (j in 1:work_len) {        idx_st <- idx_work[[i]][1]        idx_ed <- idx_work[[i]][work_len]        idx <- idx_work[[i]][j]        query_window <- as.matrix(query[idx:(idx + window_size -             1), 1])        if (j == 1) {            nni <- dist_profile(data, query, nn, index = idx)            dist_pro[, 1] <- nni$distance_profile            last_product[, 1] <- nni$last_product        }        else {            last_product[2:(data_size - window_size + 1), 1] <- last_product[1:(data_size -                 window_size), 1] - data[1:(data_size - window_size),                 1] * drop_value + data[(window_size + 1):data_size,                 1] * query_window[window_size, 1]            last_product[1, 1] <- first_product[idx, 1]            dist_pro <- 2 * (window_size - (last_product - window_size *                 nni$par$data_mean * nni$par$query_mean[idx])/(nni$par$data_sd *                 nni$par$query_sd[idx]))        }        dist_pro[dist_pro < 0] <- 0        dist_pro <- sqrt(dist_pro)        drop_value <- query_window[1, 1]        if (exclusion_zone > 0) {            exc_st <- max(1, idx - exclusion_zone)            exc_ed <- min(matrix_profile_size, idx + exclusion_zone)            dist_pro[exc_st:exc_ed, 1] <- Inf        }        dist_pro[nni$par$data_sd < vars()$eps] <- Inf        if (skip_location[idx] || any(nni$par$query_sd[idx] <             vars()$eps)) {            dist_pro[] <- Inf        }        dist_pro[skip_location] <- Inf        if (!join) {            ind <- (dist_pro[idx:matrix_profile_size] < pro_muls_left[idx:matrix_profile_size])            ind <- c(rep(FALSE, (idx - 1)), ind)            pro_muls_left[ind] <- dist_pro[ind]            pro_idxs_left[which(ind)] <- idx            ind <- (dist_pro[1:idx] < pro_muls_right[1:idx])            ind <- c(ind, rep(FALSE, matrix_profile_size - idx))            pro_muls_right[ind] <- dist_pro[ind]            pro_idxs_right[which(ind)] <- idx        }        ind <- (dist_pro < pro_muls)        pro_muls[ind] <- dist_pro[ind]        pro_idxs[which(ind)] <- idx    }    res <- list(pro_muls = pro_muls, pro_idxs = pro_idxs, pro_muls_left = pro_muls_left,         pro_idxs_left = pro_idxs_left, pro_muls_right = pro_muls_right,         pro_idxs_right = pro_idxs_right)    res}: task 1 failed - "could not find function "mass_v3""
# }