Time-series of Information Technology Operating Analytics – Part 2



Share on facebook
Share on twitter
Share on linkedin

In the Time-series of Information Technology Operating Analytics – Part 2 we will explore anomaly detection and the role this had on my research and future articles. Anomaly detection, also known as outlier analysis, is a branch of data mining. This technique learns from a large scale of datasets and identifies data points, events, and observations that deviate from a dataset’s normal behavior. After the algorithm detects the outlier/anomaly, the system will receive a kind of alert to label or extract the anomalies.


Anomaly Detection Between Value and Time


A classical alerting mechanism is a very manual process; you specify a metric, a threshold, a period, and an action to trigger if the metric continuously exceeds the threshold over the specified period. If these anomalies can be detected in time-series data, based on historical information, the thresholds used to trigger the alerts can be dynamic. The intention is that it can detect outliers in data – so for example if the CPU utilization on a virtual machine (VM) is very high during a period in which it’s normally idle, it will fire an alert. This will reduce the need for manual rules.

Project focus

In the anomaly detection, the exact research target is the computer server’s data queried from InfluxDB, which collects the time-series data of MosaicOA (IT operation tool monitoring servers). The features or parameters of computer servers’ data statistics such as CPU utilization, Memory Usage, ingress, and egress stats for a Messaging bus. The data view of correlated data features are stored as fields under various “Measurements” (1)

A “Managed Entity” is a term for a logical collection of managed variables that have been configured by the system administrator (2). The aim is to identify the anomaly points among these features by using a series of detection algorithms. Training these algorithms on data stored in the influx database recording the metrics of a set of server’s .

Anomaly detection for time-series

Anomaly detection for time-series.

Three types of anomalies are expected to be detected: Ramp-up/down, Mean Shift, Pulse. The Ramp-up/down describes a rise or drop of the data from a relatively stable trend for a short period, and then the data goes back to the stable trend. The Mean Shift represents the change of the stable level. It is similar to a step signal that jumps from one stable value to another. The Pulses mean some spikes with extreme values distributed among the raw datasets. Once the Pulses’ magnitudes and the lasting times are larger than the designed thresholds, these Pulses will be labeled. Here, the target of anomaly detection is to label or find out all the anomalies or outliers in the above three types.

To be noticed, although some outliers are labelled based on the developed algorithm and the previous datasets. These may not mean that these labelled ones are the dysfunctions of the operating system in reality. For example, some of the changes of step level, named Mean Shift above, may be caused by the lack of activity outside of trading hours. Also planned changes to the software or machine configuration may induce a similar result. So, after flagging these anomalies, some more advanced detections and analysis may be needed.

This will be discussed later on my next article under as part of innovation sandbox