In Time Series for Information Technology Operating Analytics part 4 blog, as the last blog of the Series, I will showcase in detail the anomaly algorithms I have used to develop my research, the results I have obtained and my suggestions for further research.
First of all, let’s have a review of this series so far. In my first article, I introduced the concept of time-series and it’s powerful applications and discussed the research target as well as the problem to be solved for ITOA in the second article. Then in my third article, I gave a literature review that contained some methodologies and this last Time-Series of Information Technology Operating Analytics Part 4, I will show my exact detection approaches.
Data science is a really interesting area for me and I hope that by the end of this series everyone could explore more knowledge and happiness in this area and enjoy your journey.
Algorithms & Results
The detection algorithms designed for pulse detecting mentioned in the previous blog are shown as follows. In this blog, this branch of the algorithm is named as a dynamically statistical algorithm.
The integrated algorithm can detect the pulse anomalies in the rolling-time basis. Here, the rolling can be illustrated in the following image.
It describes how the specific past dataset will determine the latest incoming data’s boundaries with the length of the input time window. The length of this time window is fixed throughout the algorithm and calculated in Python.
The time window’s period provides a training set for the machine learning algorithm. In this way, the dynamic detection for pulse anomalies can be realized.
Along with the pulse detection, there is another class of anomaly a Breakout. Breakouts are then divided into two classes named ramp or mean-shift.
The ramp-up (or down) breakout, describes a rising (or dropping) of time-series data, and the latter one describes a significant change in mean in the time series data that persists at a new stable level. So, in the visualizations, the ramp looks like an inclined slope, while mean-shift is a step.
The method used to detect breakouts is named as E-Divisive with Medians (EDM) breakout detection method, which is originally developed by Twitter Inc for the refining of user experiences. In details, EDM applies E-statistics to detect divergence of means. It’s detecting theory is based on the weighted L2-distance between random variables’ characteristic functions.
In terms of the detailed algorithm, the open-source package written in C++ and R language has mainly referred to this source. The function swig is utilized to interact C++ with Python.
The global detection algorithm is the integration of pulses detection part and breakout detection part. This algorithm can be plugged into the online database and detect the time-series data’s anomalies in real-time.
Take one dataset named: “Manage entity ADLG3991” as an example. The results from the global detection algorithm is shown as follows:
The grouped extreme values labelled in orange dots mean the long-lasting pulses. Also, the red vertical lines locate the detected breakouts. The total number of anomalies is the sum of these two parts.
In this project 42 Managed entities (for the sake of argument we could call these servers), were considered. The summarized distributions of anomaly detection are shown in the following heat map.
The x-axis shows the different servers’ datasets, which can be regarded as the features or variables of the global online system. The y-axis is the time-series’ timeline. In this project, totally four weeks are examined. This heat map is the superposition of pulse detection and breakout detection, and the number of detected anomalies are written in the centre of each block cell of the heat map.
It can be found that some features (servers) are critical since they have many detected anomalies. With the help of this packaged global algorithm, the engineers can mainly check features: ADLG3988, ADLG3991, PRD-WB01, PRD-VE01, PRD-OA01, since they have many anomalies.
The evaluations of machine learning techniques are various. For the supervised learning, the system evaluating error metrics: recall, precision and F1-score will be effective ways to express the learning effect quantitatively. Also, the learning curve and regulations will be used to visualize and solve the problems of bias and variance.
However, in this project, the working condition is a kind of unsupervised learning. In terms of unsupervised learning, no labelling data makes the evaluation be not straight-forward.
One method named cross-scoring is mentioned in Balabit’s Unsupervised blog Analytics (UBA) is taken as a case study, and the main idea is to test the measuring algorithm for some users against the activities of other groups of users with different behaviours.
This idea is based on the assumption that different groups have different characteristics. In other words, this method is to create the labelled data from different learning groups.
The one-vs-rest concept (Cross-scoring) for multiclass classification can be exemplified: A’s normal data will be B’s abnormal data. After that, the standard supervised evaluation can be utilized. However, the drawback of this method is the lack of the ability to detect outside incoming new anomalies.
That is because the algorithm uses an unsupervised algorithm to train datasets and produce the results, then, merely uses a supervised approach to verify the results.
Unsupervised evaluation is an active ongoing research field and worthy of being deeply explored.