Anomaly Detection

Meena Vyas
3 min readJan 8, 2019

--

What is Anomaly Detection

In data science, anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

In the following figure anomaly data which is a spike (shown in red color). But the same spike occurs at frequent intervals is not an anomaly.

There are 3 types of Machine Learning Techniques

  • Supervised Machine learning
  • Unsupervised Machine Learning
  • Semi- supervised Machine learning

Refer https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ for more details.

Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.

We will need Unsupervised Anomaly detection when we don’t have labelled data. i.e. we don’t have data with label of when anomaly has occurred.

Different types of Anomaly detection techniques are described below.

A safe bet is to use wisdom of the crowds by using multiple ensemble methods. We can then choose to combine them through majority vote, or union or intersection of the individual algorithms’ verdicts.

Isolation Forest and LoF

  • This is nearest neighbour based Anomaly detection
  • sklearn has IsolationForest and LocalOutlierFactor (LoF)
  • If data is too big, there is an implementation of LoF for spark

‘K’ Nearest Neighbour

  • This is a Nearest Neighbour based approach
  • Simply finding z-scores to ‘k’ nearest neighbors and using cutoff of 3 works surprisingly well in practice (though is limited to global anomalies only and can’t figure out local outliers).

One class SVM

  • Classification based approach
  • One-class Support Vector Machine (OCSVM), can be used as an unsupervised anomaly detection method.
  • However, to work well, the percentage of anomalies in the dataset needs to be low.

CBOF (Cohesiveness Based Outlier Factor

It is a clustering based Anomaly detection.

Deep Learning LSTM/Auto encoders

  • RNN, LSTM (long short term memory), auto encoders Neural network approach
  • Available in Keras/Tensorflow and other libraries
  • Typically neural networks need a lot of data

There are some more methods like probability based multivariate gaussian distribution, PCA,t-SNE.

Feel free to walk through my ipython notebook https://github.com/meenavyas/Misc/blob/master/AnomalyDetection.ipynb

In this notebook , I have tried IsolationForest amd Lof. As you can see in the plots given below, points which got high scoring from these algorithms are anomalies.

When we run anomaly detection automatically on streaming data for that we may need infrastructure like Apache Spark.

References

Originally published at meenavyas.wordpress.com.

--

--

No responses yet