Anomaly Detection and Correlation library

Overview

luminol

Python Versions Build Status

Overview

Luminol is a light weight python library for time series data analysis. The two major functionalities it supports are anomaly detection and correlation. It can be used to investigate possible causes of anomaly. You collect time series data and Luminol can:

  • Given a time series, detect if the data contains any anomaly and gives you back a time window where the anomaly happened in, a time stamp where the anomaly reaches its severity, and a score indicating how severe is the anomaly compare to others in the time series.
  • Given two time series, help find their correlation coefficient. Since the correlation mechanism allows a shift room, you are able to correlate two peaks that are slightly apart in time.

Luminol is configurable in a sense that you can choose which specific algorithm you want to use for anomaly detection or correlation. In addition, the library does not rely on any predefined threshold on the values of a time series. Instead, it assigns each data point an anomaly score and identifies anomalies using the scores.

By using the library, we can establish a logic flow for root cause analysis. For example, suppose there is a spike in network latency:

  • Anomaly detection discovers the spike in network latency time series
  • Get the anomaly period of the spike, and correlate with other system metrics(GC, IO, CPU, etc.) in the same time range
  • Get a ranked list of correlated metrics, and the root cause candidates are likely to be on the top.

Investigating the possible ways to automate root cause analysis is one of the main reasons we developed this library and it will be a fundamental part of the future work.


Installation

make sure you have python, pip, numpy, and install directly through pip:

pip install luminol

the most up-to-date version of the library is 0.4.


Quick Start

This is a quick start guide for using luminol for time series analysis.

  1. import the library
import luminol
  1. conduct anomaly detection on a single time series ts.
detector = luminol.anomaly_detector.AnomalyDetector(ts)
anomalies = detector.get_anomalies()
  1. if there is anomaly, correlate the first anomaly period with a secondary time series ts2.
if anomalies:
    time_period = anomalies[0].get_time_window()
    correlator = luminol.correlator.Correlator(ts, ts2, time_period)
  1. print the correlation coefficient
print(correlator.get_correlation_result().coefficient)

These are really simple use of luminol. For information about the parameter types, return types and optional parameters, please refer to the API.


Modules

Modules in Luminol refers to customized classes developed for better data representation, which are Anomaly, CorrelationResult and TimeSeries.

Anomaly

class luminol.modules.anomaly.Anomaly
It contains these attributes:

self.start_timestamp: # epoch seconds represents the start of the anomaly period.
self.end_timestamp: # epoch seconds represents the end of the anomaly period.
self.anomaly_score: # a score indicating how severe is this anomaly.
self.exact_timestamp: # epoch seconds indicates when the anomaly reaches its severity.

It has these public methods:

  • get_time_window(): returns a tuple (start_timestamp, end_timestamp).

CorrelationResult

class luminol.modules.correlation_result.CorrelationResult
It contains these attributes:

self.coefficient: # correlation coefficient.
self.shift: # the amount of shift needed to get the above coefficient.
self.shifted_coefficient: # a correlation coefficient with shift taken into account.

TimeSeries

class luminol.modules.time_series.TimeSeries

__init__(self, series)
  • series(dict): timestamp -> value

It has a various handy methods for manipulating time series, including generator iterkeys, itervalues, and iteritems. It also supports binary operations such as add and subtract. Please refer to the code and inline comments for more information.


API

The library contains two classes: AnomalyDetector and Correlator, and there are two sets of APIs, one corresponding to each class. There are also customized modules for better data representation. The Modules section in this documentation may provide useful information as you walk through the APIs.

AnomalyDetector

class luminol.anomaly_detector.AnomalyDetecor

__init__(self, time_series, baseline_time_series=None, score_only=False, score_threshold=None,
         score_percentile_threshold=None, algorithm_name=None, algorithm_params=None,
         refine_algorithm_name=None, refine_algorithm_params=None)
  • time_series: The metric you want to conduct anomaly detection on. It can have the following three types:
1. string: # path to a csv file
2. dict: # timestamp -> value
3. lumnol.modules.time_series.TimeSeries
  • baseline_time_series: an optional baseline time series of one the types mentioned above.
  • score only(bool): if asserted, anomaly scores for the time series will be available, while anomaly periods will not be identified.
  • score_threshold: if passed, anomaly scores above this value will be identified as anomaly. It can override score_percentile_threshold.
  • score_precentile_threshold: if passed, anomaly scores above this percentile will be identified as anomaly. It can not override score_threshold.
  • algorithm_name(string): if passed, the specific algorithm will be used to compute anomaly scores.
  • algorithm_params(dict): additional parameters for algorithm specified by algorithm_name.
  • refine_algorithm_name(string): if passed, the specific algorithm will be used to compute the time stamp of severity within each anomaly period.
  • refine_algorithm_params(dict): additional parameters for algorithm specified by refine_algorithm_params.

Available algorithms and their additional parameters are:

1.  'bitmap_detector': # behaves well for huge data sets, and it is the default detector.
    {
      'precision'(4): # how many sections to categorize values,
      'lag_window_size'(2% of the series length): # lagging window size,
      'future_window_size'(2% of the series length): # future window size,
      'chunk_size'(2): # chunk size.
    }
2.  'default_detector': # used when other algorithms fails, not meant to be explicitly used.
3.  'derivative_detector': # meant to be used when abrupt changes of value are of main interest.
    {
      'smoothing factor'(0.2): # smoothing factor used to compute exponential moving averages
                                # of derivatives.
    }
4.  'exp_avg_detector': # meant to be used when values are in a roughly stationary range.
                        # and it is the default refine algorithm.
    {
      'smoothing factor'(0.2): # smoothing factor used to compute exponential moving averages.
      'lag_window_size'(20% of the series length): # lagging window size.
      'use_lag_window'(False): # if asserted, a lagging window of size lag_window_size will be used.
    }

It may seem vague for the meanings of some parameters above. Here are some useful insights:

The AnomalyDetector class has the following public methods:

  • get_all_scores(): returns an anomaly score time series of type TimeSeries.
  • get_anomalies(): return a list of Anomaly objects.

Correlator

class luminol.correlator.Correlator

__init__(self, time_series_a, time_series_b, time_period=None, use_anomaly_score=False,
         algorithm_name=None, algorithm_params=None)
  • time_series_a: a time series, for its type, please refer to time_series for AnomalyDetector above.
  • time_series_b: a time series, for its type, please refer to time_series for AnomalyDetector above.
  • time_period(tuple): a time period where to correlate the two time series.
  • use_anomaly_score(bool): if asserted, the anomaly scores of the time series will be used to compute correlation coefficient instead of the original data in the time series.
  • algorithm_name: if passed, the specific algorithm will be used to calculate correlation coefficient.
  • algorithm_params: any additional parameters for the algorithm specified by algorithm_name.

Available algorithms and their additional parameters are:

1.  'cross_correlator': # when correlate two time series, it tries to shift the series around so that it
                       # can catch spikes that are slightly apart in time.
    {
      'max_shift_seconds'(60): # maximal allowed shift room in seconds,
      'shift_impact'(0.05): # weight of shift in the shifted coefficient.
    }

The Correlator class has the following public methods:

  • get_correlation_result(): return a CorrelationResult object.
  • is_correlated(threshold=0.7): if coefficient above the passed in threshold, return a CorrelationResult object. Otherwise, return false.

Example

  1. Calculate anomaly scores.
from luminol.anomaly_detector import AnomalyDetector

ts = {0: 0, 1: 0.5, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0}

my_detector = AnomalyDetector(ts)
score = my_detector.get_all_scores()
for timestamp, value in score.iteritems():
    print(timestamp, value)

""" Output:
0 0.0
1 0.873128250131
2 1.57163085024
3 2.13633686334
4 1.70906949067
5 2.90541813415
6 1.17154110935
7 0.937232887479
8 0.749786309983
"""
  1. Correlate ts1 with ts2 on every anomaly.
from luminol.anomaly_detector import AnomalyDetector
from luminol.correlator import Correlator

ts1 = {0: 0, 1: 0.5, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0}
ts2 = {0: 0, 1: 0.5, 2: 1, 3: 0.5, 4: 1, 5: 0, 6: 1, 7: 1, 8: 1}

my_detector = AnomalyDetector(ts1, score_threshold=1.5)
score = my_detector.get_all_scores()
anomalies = my_detector.get_anomalies()
for a in anomalies:
    time_period = a.get_time_window()
    my_correlator = Correlator(ts1, ts2, time_period)
    if my_correlator.is_correlated(threshold=0.8):
        print("ts2 correlate with ts1 at time period (%d, %d)" % time_period)

""" Output:
ts2 correlates with ts1 at time period (2, 5)
"""

Contributing

Clone source and install package and dev requirements:

pip install -r requirements.txt
pip install pytest pytest-cov pylama

Tests and linting run with:

python -m pytest --cov=src/luminol/ src/luminol/tests/
python -m pylama -i E501 src/luminol/
Issues
  • Python 3.6 doesn't run your examples

    Python 3.6 doesn't run your examples

    Looks like the support for Python 3 is not completely developed yet. With Python 3.6 the examples you have don't run:

    from luminol.anomaly_detector import AnomalyDetector

    What's the easy fix for this?

    opened by sjjpo2002 8
  • Add sign test to algorithms

    Add sign test to algorithms

    Patterned off of percent difference with added parameters

    opened by noblerwe 7
  • add Python 3

    add Python 3

    opened by brennv 6
  • latest version (0.4) not published to pypi?

    latest version (0.4) not published to pypi?

    Hello! I see that all of @brennv 's PRs have been merged in (see issue #22 ), and the package version has been incremented here in the repo, but PyPi has not yet been updated to v0.4.

    @RiteshMaheshwari , could I ask you for one last favor: Publish the latest version of luminol to PyPi, so we can reap the benefits of all those recent commits? Again, if you're not the person to tag / nag, please point me in the right direction. Thanks for your help!

    opened by bdewilde 4
  • Fixed typos in README

    Fixed typos in README

    Hi,

    I fixed few small typos in README.md file. Please check once and let me know if there are any issues.

    Thanks!

    Also, it would be good if we specify the latest version of dependencies in requirements.txt file.

    opened by vicky002 4
  • pep8

    pep8

    Any time when suggesting pep8 changes we are reminded of this. With that in mind, I'd like to suggest some aesthetic changes for convention and maintainability. In our Travis runs we check pep8 compliance as the last step. The last results before this PR are shown here:

    https://travis-ci.org/linkedin/luminol/jobs/273116296#L663

    This first commit serves only to make imports more explicit. It resolves all the warnings that read: W0401 'from luminol.constants import *' used; unable to detect undefined names [pyflakes]

    I'd like to add additional commits to this PR if folks are open to it.

    opened by brennv 4
  • Fix spacing

    Fix spacing

    opened by brennv 4
  • problem with import

    problem with import

    I have install the package via pip (using python 2.7.12) When I import the module luminol , it lacks all of the basic functions. I've atached a print screen. Can you help me ? image

    opened by guyoh 3
  • error in AnomalyDetector instantiation

    error in AnomalyDetector instantiation

    I'm trying to run the Quick Start example, and in the very first command that instantiates a detector I get the following error:

    ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    

    The command run is

    detector = anomaly_detector.AnomalyDetector(ts)
    

    Could this be caused by a change in the Pandas library?

    I installed the last version of luminol, 0.3.1, with pip and I'm using Python 2.7.11 from the Anaconda distribution version 4.1.0 on Kubuntu 15.10.

    This is the full traceback of the error:

    <class 'pandas.core.series.Series'>
    Traceback (most recent call last):
      File "open_heat_treatments.py", line 97, in <module>
        main()
      File "open_heat_treatments.py", line 90, in main
        detector = anomaly_detector.AnomalyDetector(ts)
      File "/home/dp/anaconda2/lib/python2.7/site-packages/luminol/anomaly_detector.py", line 44, in __init__
        self.time_series = self._load(time_series)
      File "/home/dp/anaconda2/lib/python2.7/site-packages/luminol/anomaly_detector.py", line 69, in _load
        if not time_series:
      File "/home/dp/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 892, in __nonzero__
        .format(self.__class__.__name__))
    ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    
    opened by ghost 2
  • Fix tiny typo

    Fix tiny typo

    opened by that-guy-iain 0
  • Import anomaly_detector and correlator into luminol

    Import anomaly_detector and correlator into luminol

    With this, the example from README works.

    In other words with proposed change one can import luminol and later detector = luminol.anomaly_detector.AnomalyDetector(ts), currently anomaly_detector is not visible as a member of luminol module.

    opened by sjachim 0
  • Update exp_avg_detector.py

    Update exp_avg_detector.py

    abs bug fix

    opened by kidluo 0
  • Warn user on automatic modify of algorithm or parameters

    Warn user on automatic modify of algorithm or parameters

    Please throw a warning message to the user when automatically modifying parameters or algorithms. Doing this silently makes it extremely difficulty to debug and fine-tune.

    https://github.com/linkedin/luminol/blob/42e4ab969b774ff98f902d064cb041556017f635/src/luminol/algorithms/anomaly_detector_algorithms/bitmap_detector.py#L60-L73

    https://github.com/linkedin/luminol/blob/42e4ab969b774ff98f902d064cb041556017f635/src/luminol/anomaly_detector.py#L91-L104

    opened by devinaconley 0
  • error in diff_percent_threshold.py

    error in diff_percent_threshold.py

    the code in enumerater should be baseline_value = self.baseline_time_series[timestamp] instead of baseline_value = self.baseline_time_series[i]. otherwise it will give "timestamp does not exist in time series object" exception.

    opened by khushalvora 0
  • Package Definition

    Package Definition

    Hi,

    This is not really an issue but couple questions. The example code that calculates the anomaly scores e.g:

    from luminol.anomaly_detector import AnomalyDetector

    ts = {0: 0, 1: 0.5, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0}

    my_detector = AnomalyDetector(ts) score = my_detector.get_all_scores() for timestamp, value in score.iteritems(): print(timestamp, value)

    Does it calculate the scores as they come like a real-time anomaly detection instead of looking at what the value is before? Is there a way to tune the parameters of the above code as well like the window size and chunk size? If so, how?

    Thank you very much.

    opened by priencesstan 0
  • Materials to read about anomaly detection

    Materials to read about anomaly detection

    Not really a code related question, but more of a methods question. Is there some material on the basic concepts that have been used to develop the luminol package? For example, whats the basic idea behind detecting the anomaly, how to interpret the score, how does the algorithm handles the seasonality and trend in the data. Should we make the time series stationary before using it? Or how does the package manages to work with data with non-stationary time series?

    Thanks

    opened by PanditPranav 1
  • AnomalyDetector parameter typo corrected

    AnomalyDetector parameter typo corrected

    AnomalyDetector class constructor parameter "score_percent_threshold" typed incorrectly as "score_precentile_threshold".

    opened by sametdumankaya 0
  • Installation error in Alpine

    Installation error in Alpine

    I am trying to install Luminol in Alpine, but its throwing error while installing numpy. Is it possible to install Luminol in Alpine with python 3.6?

    opened by diksha-rawat 0
  • without pip install, can I use it?

    without pip install, can I use it?

    I am trying to use it inside out restricted environment, is there a way I can download the package and run following instructions from you in our DEV environment?

    opened by rroyhere 0
Owner
LinkedIn
LinkedIn
Optimal Randomized Canonical Correlation Analysis

ORCCA Optimal Randomized Canonical Correlation Analysis This project is for the python version of ORCCA algorithm. It depends on Numpy for matrix calc

Yinsong Wang 1 Nov 21, 2021
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 19k Feb 4, 2022
This repo implements a Topological SLAM: Deep Visual Odometry with Long Term Place Recognition (Loop Closure Detection)

This repo implements a topological SLAM system. Deep Visual Odometry (DF-VO) and Visual Place Recognition are combined to form the topological SLAM system.

Best of Australian Centre for Robotic Vision (ACRV) 25 Nov 25, 2021
YouTube Spam Detection with python

YouTube Spam Detection This code deletes spam comment on youtube videos based on two characteristics (currently) If the author of the comment has a se

MohamadReza Taalebi 5 Oct 28, 2021
Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Sean Zahller 1 Feb 4, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 22.1k Jan 27, 2022
Uber Open Source 1.3k Jan 25, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 3.8k Jan 31, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.3k Feb 2, 2022
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 581 Jan 26, 2022
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 354 Jan 8, 2022
ThunderSVM: A Fast SVM Library on GPUs and CPUs

What's new We have recently released ThunderGBM, a fast GBDT and Random Forest library on GPUs. add scikit-learn interface, see here Overview The miss

Xtra Computing Group 1.4k Feb 2, 2022
A python library for easy manipulation and forecasting of time series.

Time Series Made Easy in Python darts is a python library for easy manipulation and forecasting of time series. It contains a variety of models, from

Unit8 3.6k Feb 1, 2022
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.1k Jan 27, 2022
A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Target 671 Jan 24, 2022
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. 10x Larger Models 10x Faster Trainin

Microsoft 6.2k Feb 2, 2022
An open-source library of algorithms to analyse time series in GPU and CPU.

An open-source library of algorithms to analyse time series in GPU and CPU.

Shapelets 202 Jan 8, 2022
´╗┐Greykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.4k Jan 26, 2022