This Multivariate Time Series Clustering project follows the development of a Long Short-Term Memory (LSTM), as part of T-DAB’s Innovation Sandbox, to predict the rudder movements that a sailor would make during a race.
If you’re familiar with The Innovation Sandbox and the Jack Trigger Racing (JTR) Project, you’ll be familiar with the development, where we’ve outlined the technical challenges and how, depending on the conditions, a sailor wouldn’t always sail the boat in the same way. The data used in this project is provided from Jack Tigger during his single-handed races; the data is time series, high dimensional and large size.
For this project, I will explore different sailing modes to better train the LSTM model by passing more information to it. From a data science point of view, the main task of this project is to cluster the sailing time series data. Some essential features of this data include multivariate, high-dimensional and large size so that it is much more complicated than analyzing univariate time-series data. Due to its nature, the critical steps in this project are as follows:
- Time-series segmentation,
- Dimensionality reduction,
- Pattern discovery,
- Cluster analysis.
Time Series Segmentation
Since a time series is continuous and considered as whole instead of an individual object, time-series segmentation technique is needed to discretize such a sequence of the data. There are many segmentation techniques one can find in related papers, such as fixed–length sliding window approach, PIP, minimum message length (MML), minimum description length (MDL) and sliding window and bottom-up (SWAB), etc. During the project, we will use overlapped sliding window approach to capture the dynamic behaviour of the time series data under the consideration of the computational cost.
Next, we can apply the dimensionality reduction on this discretized time series data set. At the moment, there are three approaches that we prefer to use:
- Principle Component Analysis (PCA)
- Multidimensional Scaling (MDS).
PCA is a linear technique and has been widely used in many recent papers. Furthermore, it is parameter-free and easy to perform in Python. t-SNE and MDS are two nonlinear manifold learning approaches, but they are working in a completely different manner. MDS seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space, while t-SNE tries to preserve the information indicating which points are neighbours to each other. With different techniques, we will have a different shape of the data. Moreover, the choice of the normalization method (e.g., L2-norm, Z-normalization, MinMaxScaler) will also affect the structure of the resulting low-dimensional data.
Once the low-dimensional (2D/3D) data is obtained, we can then do cluster analysis. In our project, this is an unsupervised problem as we don’t know the right number of clusters, i.e., the number of sailing modes in advance, so we are expecting to use an analytical method to solve this problem, for instance, resampling-based consensus clustering algorithm. The basic idea is that we will use a resampling scheme to introduce perturbations into the dataset after then we will assess the stability of discovered clusters within this perturbed dataset, which would give us the optimal number of clusters at the end of iterations. The involved algorithms in this process would be subsampling and bootstrapping for resampling, K-Means, DBSCAN and CLRANS for clustering.
Testing on LSTM
Lastly, we will label those derived clusters, for instance, tacking and jibing, then implement the LSTM model on different groups to see how it performs. The output will be analyzed in the section of discussion and conclusion that would give any further directions for future projects.