Anomaly Detection

3 min readJan 8, 2019

What is Anomaly Detection

In data science, anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.

In the following figure anomaly data which is a spike (shown in red color). But the same spike occurs at frequent intervals is not an anomaly.

There are 3 types of Machine Learning Techniques

Supervised Machine learning
Unsupervised Machine Learning
Semi- supervised Machine learning

Refer https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/ for more details.

Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.

We will need Unsupervised Anomaly detection when we don’t have labelled data. i.e. we don’t have data with label of when anomaly has occurred.

Different types of Anomaly detection techniques are described below.

A safe bet is to use wisdom of the crowds by using multiple ensemble methods. We can then choose to combine them through majority vote, or union or intersection of the individual algorithms’ verdicts.

Isolation Forest and LoF

This is nearest neighbour based Anomaly detection
sklearn has IsolationForest and LocalOutlierFactor (LoF)
If data is too big, there is an implementation of LoF for spark

‘K’ Nearest Neighbour

This is a Nearest Neighbour based approach
Simply finding z-scores to ‘k’ nearest neighbors and using cutoff of 3 works surprisingly well in practice (though is limited to global anomalies only and can’t figure out local outliers).

One class SVM

Classification based approach
One-class Support Vector Machine (OCSVM), can be used as an unsupervised anomaly detection method.
However, to work well, the percentage of anomalies in the dataset needs to be low.

CBOF (Cohesiveness Based Outlier Factor

It is a clustering based Anomaly detection.

Deep Learning LSTM/Auto encoders

RNN, LSTM (long short term memory), auto encoders Neural network approach
Available in Keras/Tensorflow and other libraries
Typically neural networks need a lot of data

There are some more methods like probability based multivariate gaussian distribution, PCA,t-SNE.

Feel free to walk through my ipython notebook https://github.com/meenavyas/Misc/blob/master/AnomalyDetection.ipynb

In this notebook , I have tried IsolationForest amd Lof. As you can see in the plots given below, points which got high scoring from these algorithms are anomalies.

When we run anomaly detection automatically on streaming data for that we may need infrastructure like Apache Spark.

References

Originally published at meenavyas.wordpress.com.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Meena Vyas

12 Followers

6 Following

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Anomaly Detection: Local Outlier Factor(LOF)

Mine Küçükavşar