Extended Isolation Forest for Anomaly Detection

Related tags

Machine Learningeif
Overview

latest releasepypi version

Table of contents

Extended Isolation Forest

This is a simple Python implementation for the Extended Isolation Forest method described in this (https://doi.org/10.1109/TKDE.2019.2947676). It is an improvement on the original algorithm Isolation Forest which is described (among other places) in this paper for detecting anomalies and outliers for multidimensional data point distributions. An R wrapper around the core Python implementation can be found here.

Summary

The problem of anomaly detection has wide range of applications in various fields and scientific applications. Anomalous data can have as much scientific value as normal data or in some cases even more, and it is of vital importance to have robust, fast and reliable algorithms to detect and flag such anomalies. Here, we present an extension to the model-free anomaly detection algorithm, Isolation Forest Liu2008. This extension, named Extended Isolation Forest (EIF), improves the consistency and reliability of the anomaly score produced by standard methods for a given data point. We show that the standard Isolation Forest produces inconsistent anomaly score maps, and that these score maps suffer from an artifact produced as a result of how the criteria for branching operation of the binary tree is selected.

Our method allows for the slicing of the data to be done using hyperplanes with random slopes which results in improved score maps. The consistency and reliability of the algorithm is much improved using this extension. Here we show the need for an improvement on the source algorithm to improve the scoring of anomalies and the robustness of the score maps especially around edges of nominal data. We discuss the sources of the problem, and we present an efficient way for choosing these hyperplanes which give way to multiple extension levels in the case of higher dimensional data. The standard Isolation Forest is therefore a special case of the Extended Isolation Forest as presented it here. For an N dimensional dataset, Extended Isolation Forest has N levels of extension, with 0 being identical to the case of standard Isolation Forest, and N-1 being the fully extended version.

Motivation

Example training data. a) Normally distributed cluster. b) Two normally distributed clusters. c) Sinusoidal data points with Gaussian noise.

Figure 1: Example training data. a) Normally distributed cluster. b) Two normally distributed clusters. c) Sinusoidal data points with Gaussian noise.

While various techniques exist for approaching anomaly detection, Isolation Forest Liu2008 is one with unique capabilities. This algorithm can readily work on high dimensional data, it is model free, and it scales well. It is therefore highly desirable and easy to use. However, looking at score maps for some basic example, we can see that the anomaly scores produced by the standard Isolation Forest are inconsistent, . To see this we look at the three examples shown in Figure 1.

In each case, we use the data to train our Isolation Forest. We then use the trained models to score a square grid of uniformly distributed data points, which results in score maps shown in Figure 2. Through the simplicity of the example data, we have an intuition about what the score maps should look like. For example, for the data shown in Figure 1a, we expect to see low anomaly scores in the center of the map, while the anomaly score should increase as we move radially away from the center. Similarly for the other figures.

Looking at the score maps produced by the standard Isolation Forest shown in Figure 2, we can clearly see the inconsistencies in the scores. While we can clearly see a region of low anomaly score in the center in Figure 2a, we can also see regions aligned with x and y axes passing through the origin that have lower anomaly scores compared to the four corners of the region. Based on our intuitive understanding of the data, this cannot be correct. A similar phenomenon is observed in Figure 2b. In this case, the problem is amplified. Since there are two clusters, the artificially low anomaly score regions intersect close to points (0,0) and (10,10), and create low anomaly score regions where there is no data. It is immediately obvious how this can be problematic. As for the third example, figure 2c shows that the structure of the data is completely lost. The sinusoidal shape is essentially treated as one rectangular blob.

Score maps using the Standard Isolation Forest for the points from Figure 1. We can see the bands and artifacts on these maps

Figure 2: Score maps using the Standard Isolation Forest for the points from Figure 1. We can see the bands and artifacts on these maps

Isolation Forest

Given a dataset of dimension N, the algorithm chooses a random sub-sample of data to construct a binary tree. The branching process of the tree occurs by selecting a random dimension x_i with i in {1,2,...,N} of the data (a single variable). It then selects a random value v within the minimum and maximum values in that dimension. If a given data point possesses a value smaller than v for dimension x_i, then that point is sent to the left branch, otherwise it is sent to the right branch. In this manner the data on the current node of the tree is split in two. This process of branching is performed recursively over the dataset until a single point is isolated, or a predetermined depth limit is reached. The process begins again with a new random sub-sample to build another randomized tree. After building a large ensemble of trees, i.e. a forest, the training is complete.

During the scoring step, a new candidate data point (or one chosen from the data used to create the trees) is run through all the trees, and an ensemble anomaly score is assigned based on the depth the point reaches in each tree. Figure 3 shows an schematic example of a tree and a forest plotted radially.

a) Shows an example tree formed from the example data while b) shows the forest generated where each tree is represented by a radial line from the center to  the  outer  circle.  Anomalous  points  (shown  in  red)  are  isolated  very  quickly,which means they reach shallower depths than nominal points (shown in blue).

Figure 3: a) Shows an example tree formed from the example data while b) shows the forest generated where each tree is represented by a radial line from the center to the outer circle. Anomalous points (shown in red) are isolated very quickly,which means they reach shallower depths than nominal points (shown in blue).

It turns out the splitting process described above is the main source of the bias observed in the score maps. Figure 4 shows the process described above for each one of the examples considered thus far. The branch cuts are always parallel to the axes, and as a result over construction of many trees, regions in the domain that don't occupy any data points receive superfluous branch cuts.

Splitting of data in the domain during the process of construction of one tree.

Figure 4: Splitting of data in the domain during the process of construction of one tree.

Extension

The Extended Isolation Forest remedies this problem by allowing the branching process to occur in every direction. The process of choosing branch cuts is altered so that at each node, instead of choosing a random feature along with a random value, we choose a random normal vector along with a random intercept point.

Figure 5 shows the resulting branch cuts int he domain for each of our examples.

Same as Figure 4 but using Extended Isolation Forest

Figure 5: Same as Figure 4 but using Extended Isolation Forest

We can see that the region is divided much more uniformly, and without the bias introducing effects of the coordinate system. As in the case of the standard Isolation Forest, the anomaly score is computed by the aggregated depth that a given point reaches on each iTree.

As we see in Figure 6, these modifications completely fix the issue with the score maps that we saw before and produce reliable results. Clearly, these score maps are a much better representation of anomaly score distributions.

Score maps using the Extended Isolation Forest.

Figure 6: Score maps using the Extended Isolation Forest.

Figure 7 shows a very simple example of anomalies and nominal points from a Single blob example as shown in Figure 1a. It also shows the distribution of the anomaly scores which can be used to make hard cuts on the definition of anomalies or even assign probabilities to each point.

a) Shows the dataset used, some sample anomalous data points discovered using the algorithm are highlighted in black. We also highlight some nominal points in red. In b), we have the distribution of anomaly scores obtained by the algorithm.

Figure 7: a) Shows the dataset used, some sample anomalous data points discovered using the algorithm are highlighted in black. We also highlight some nominal points in red. In b), we have the distribution of anomaly scores obtained by the algorithm.

The Code

Here we provide the source code for the algorithm as well as documented example notebooks to help get started. Various visualizations are provided such as score distributions, score maps, aggregate slicing of the domain, and tree and whole forest visualizations. Most examples are in 2D. We present one 3D example. However, the algorithm works readily with higher dimensional data.

Installation

pip install eif

or directly from the repository

pip install git+https://github.com/sahandha/eif.git

Alternatively, you can install the eif R package from here, which provides an R wrapper around the core Python implementation.

Requirements

  • numpy
  • cython

No extra requirements are needed. In addition, it also contains means to draw the trees created using the igraph library. See the example for tree visualizations.

Use

See these notebooks for examples on how to use it

Citation

If you use this code and method, please considering using the following reference:

A link to the paper can be found here

@ARTICLE{8888179,
author={S. {Hariri} and M. {Carrasco Kind} and R. J. {Brunner}},
journal={IEEE Transactions on Knowledge and Data Engineering},
title={Extended Isolation Forest},
year={2019},
volume={},
number={},
pages={1-1},
keywords={Forestry;Vegetation;Distributed databases;Anomaly detection;Standards;Clustering algorithms;Heating systems;Anomaly Detection;Isolation Forest},
doi={10.1109/TKDE.2019.2947676},
ISSN={},
month={},}

Releases

v2.0.2

2019-NOV-14

  • Convert code into C++ with using cython.
  • Much faster and efficient forest generation and scoring procedures.
  • Previous implementation renamed, use import eif_old to use old version

v1.0.2

2018-OCT-01

  • Release
  • Added documentation, examples and software paper

v1.0.1

2018-AUG-08

  • Bugfix for multidimensional data

v1.0.0

2018-JUL-15

  • Initial Release
Comments
  • Error while installing eif

    Error while installing eif

    Hi!

    Trying to install eif through pip I get the following error:

    
    (base) C:\WINDOWS\system32>pip install eif
    Collecting eif
      Using cached https://files.pythonhosted.org/packages/83/b2/d87d869deeb192ab599c899b91a9ad1d3775d04f5b7adcaf7ff6daa54c24/eif-2.0.2.tar.gz
    Requirement already satisfied: numpy in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (1.16.5)
    Requirement already satisfied: cython in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (0.29.13)
    Building wheels for collected packages: eif
      Building wheel for eif (setup.py) ... error
      ERROR: Command errored out with exit status 1:
       command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-wheel-kw_2kpwv' --python-tag cp37
           cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
      Complete output (60 lines):
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.7
      copying eif_old.py -> build\lib.win-amd64-3.7
      copying version.py -> build\lib.win-amd64-3.7
      running egg_info
      writing eif.egg-info\PKG-INFO
      writing dependency_links to eif.egg-info\dependency_links.txt
      writing requirements to eif.egg-info\requires.txt
      writing top-level names to eif.egg-info\top_level.txt
      reading manifest file 'eif.egg-info\SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      writing manifest file 'eif.egg-info\SOURCES.txt'
      running build_ext
      cythoning _eif.pyx to _eif.cpp
      building 'eif' extension
      creating build\temp.win-amd64-3.7
      creating build\temp.win-amd64-3.7\Release
      C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
      In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                       from eif.hxx:5,
                       from _eif.cpp:614:
      C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
       #error This file requires compiler and library support for the \
        ^
      In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                       from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                       from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                       from _eif.cpp:612:
      C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                                "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                                   ^
      In file included from _eif.cpp:614:0:
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
               void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                             ^
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
               Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                                  ^
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
       inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                     ^
      _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
      _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
                   module_name, class_name, size, basicsize);
                                                           ^
      _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
      _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
      error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
      ----------------------------------------
      ERROR: Failed building wheel for eif
      Running setup.py clean for eif
    Failed to build eif
    Installing collected packages: eif
        Running setup.py install for eif ... error
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile
             cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
        Complete output (60 lines):
        running install
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.7
        copying eif_old.py -> build\lib.win-amd64-3.7
        copying version.py -> build\lib.win-amd64-3.7
        running egg_info
        writing eif.egg-info\PKG-INFO
        writing dependency_links to eif.egg-info\dependency_links.txt
        writing requirements to eif.egg-info\requires.txt
        writing top-level names to eif.egg-info\top_level.txt
        reading manifest file 'eif.egg-info\SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        writing manifest file 'eif.egg-info\SOURCES.txt'
        running build_ext
        skipping '_eif.cpp' Cython extension (up-to-date)
        building 'eif' extension
        creating build\temp.win-amd64-3.7
        creating build\temp.win-amd64-3.7\Release
        C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
        In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                         from eif.hxx:5,
                         from _eif.cpp:614:
        C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
         #error This file requires compiler and library support for the \
          ^
        In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                         from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                         from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                         from _eif.cpp:612:
        C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                                  "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                                     ^
        In file included from _eif.cpp:614:0:
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
                 void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                               ^
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
                 Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                                    ^
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
         inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                       ^
        _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
        _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
                     module_name, class_name, size, basicsize);
                                                             ^
        _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
        _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
        error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
        ----------------------------------------
    ERROR: Command errored out with exit status 1: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.
    
    
    opened by PoradaKev 25
  • Can the extension concept Applied to Gradient Boosted Machine?

    Can the extension concept Applied to Gradient Boosted Machine?

    Hi there,

    This might be dummy questions.

    I was curious whether the "extension" concept that you introduce can be applied to Supervised version such as Gradient Boosted Trees algorithm or not. There was several widely known Implementation like XGBoost or LightGBM. All of these GBT also suffer from "box" like decision boundary. I believe it would be great to see GBT to create decision boundary the way your extended isolation forest was producing.

    What do you guys think?

    Feel free to close this issue since its not real issue, just discussion.

    opened by alfian777 5
  • Installation problem

    Installation problem

    Hello, i'm trying to install this package, and i'm having error messages and i don't get to install it. Can you help?

    Windows 10

    (base) C:\Users\quirosgu>pip install eif Collecting eif Using cached eif-2.0.2.tar.gz (1.6 MB) Requirement already satisfied: numpy in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (1.18.5) Requirement already satisfied: cython in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (0.29.21) Building wheels for collected packages: eif Building wheel for eif (setup.py) ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\quirosgu\AppData\Local\Temp\pip-wheel-6t9epked' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
    Complete output (19 lines): running bdist_wheel running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

    ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
    Complete output (19 lines): running install running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/ ---------------------------------------- ERROR: Command errored out with exit status 1: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' Check the logs for full command output.

    In this case, i already installed all the dependencies required MVC++, etc, but the problem continues.

    I tried to reproduce it in another WIndows machine and it does not work, contrary, in a Linux based system it does work.

    opened by luigiquiros 3
  • PR for Parallelization and Reduce Memory

    PR for Parallelization and Reduce Memory

    Hello,

    For high dimensional datasets, I'm finding multi-processing parallelization can speed things up a bit. I also, find that storing the original data in each Node and each iTree consumes a lot of needless memory. Would you be open to reviewing a Pull Request(s) that addressed both of these items? If so, would you accept them bundled together as one PR or would you like them separated?

    Thanks

    opened by pford221 3
  • Use in novelty detection/one-class classification

    Use in novelty detection/one-class classification

    From what I understand, your api doesn't distinguish between constructing the trees and querying to obtain scores (like the fit/predict methods of scikit-learn), is that correct?

    So it's not currently possible to use this implementation for novelty detection/one-class classification, where the training set is different from the test set?

    opened by oulenz 2
  • Scoring takes too long

    Scoring takes too long

    My training and validation data are of similar size (about 1,500,000 rows and 11 features). Model building took very less time even with full extension. But, when scoring the validation data using compute_paths, the function has been running for close to 15 hours and still scoring is not done. Is there some way to speed up the scoring process?

    opened by thedarklord310780 2
  • Add Arxiv paper to readme

    Add Arxiv paper to readme

    Thanks for providing this code. Please add mention of and a link to your associated Arxiv paper into the repo's readme. The link is https://arxiv.org/abs/1811.02141

    opened by impredicative 2
  • setting ExtensionLevel

    setting ExtensionLevel

    If I understand the paper correctly, we obtain the full EIF approach by setting ExtensionLevel equal to the number of dimensions of the data minus 1, correct?

    opened by oulenz 1
  • Small fix install progress

    Small fix install progress

    One of the extra compile arguments in setup.py seemed to prevent successful installation on multiple systems. Simply removing this argument seems to resolve this with no negative implications. The argument seems to try and force the compiler to run in c++11. Unsure if this was even present on the tested systems

    opened by Dainean 1
  • Update eif.py

    Update eif.py

    Goal: for more convenient usage Inspired by the tutorial document, I added two functions, outlier_pred and outlier_index into iForest, which returns the outlier prediction index and label matrix.

    opened by MaiRajborirug 0
  • How to save the eif Model?

    How to save the eif Model?

    I am trying to save the model using pickle.dump() but this not working. How do I save the eif model? Please provide me a solution as I am stuck with this problem. Thank you.

    opened by SanthanaMano 0
  • module 'eif' has no attribute '__version__'

    module 'eif' has no attribute '__version__'

    i install eif by "pip install eif" and Successfully installed eif-2.0.2 but when i use eif.iForest arise attributeError: module 'eif' has no attribute 'version'

    opened by wererLinC 0
  • I can't install eif 2.0.2, please tell me the reason

    I can't install eif 2.0.2, please tell me the reason

    (base) C:\Users\22393\eif-2.0.2\eif-2.0.2>python setup.py install running install running bdist_egg running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_py running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IE:\ProgramFiles\anaconda\lib\site-packages\numpy\core\include -IE:\ProgramFiles\anaconda\include -IE:\ProgramFiles\anaconda\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.8\Release_eif.obj -Wcpp cl: 命令行 error D8021 :无效的数值参数“/Wcpp” error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit status 2

    opened by whmwhm123 0
  • Unable to install eif2.0.2

    Unable to install eif2.0.2

    Dear Team, I am getting below error while trying install eif2.02 . Methods tried:

    1. pip install eif
    2. Downloaded eif tar file from pypi.org and tried installing
    3. Downloaded the code from github and tried installing
    4. In one of the issue it is mentioned to edit setup.py file(Remove the extra_compile line) and executed

    failed in all above methods, Below is the error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-wheel-wjzxwp64' --python-tag cp37: ERROR: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

    ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile: ERROR: running install running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2 ---------------------------------------- ERROR: Command "'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\

    Please help.

    opened by botlavijaykumar 1
  • Effect of feature scaling

    Effect of feature scaling

    Hi thanks for the great package (and example notebooks!). My issue is summarised in two points:

    • It appears that feature scale influences the orientation of the hyperplane splits in the trees, resulting in a poor anomaly score map.
    • Is this expected behaviour? If so, can anyone offer an explanation as to how this comes about as it seems from the paper that the orientation of all hyperplanes are random.

    The following illustrates this further:

    I have noticed that the extended forest shows odd results when applied to features with very different scales. For example if I draw 2D points from 2 normal distributions with variance 1 and 1000 and plot the contour maps comparing the regular iForest and the extended we see the contours become horizontal and the heat map in general is not good compared to the regular iForest. image

    It seems as though the choice of hyperplane gets biased towards horizontal lines. This is also notable in the examples given in the paper (figure 9) where 3 plots of tree splits are shown: image Here we see the first two examples (a and b) the x and y values of the data lie on the same scale and the splits look randomly orientated. However in c) the x scale of the data is much larger than y scale, and most splits look more vertical. As a result we seen areas of higher anomaly score above and below the point cloud in the resulting heat map: image

    This issue is easily fixed by simply scaling all features before using the forest. However I was wondering if the splits are done on a hyperplane of random orientation why/how does feature scale influence the orientation of splits in each tree?

    Apologies if I am missing something obvious, any insight would be useful, thanks!

    opened by felixcaz 0
Releases(v2.0.2)
Owner
Sahand Hariri
Sahand Hariri
CobraML: Completely Customizable A python ML library designed to give the end user full control

CobraML: Completely Customizable What is it? CobraML is a python library built on both numpy and numba. Unlike other ML libraries CobraML gives the us

Sriram Govindan 14 Dec 19, 2021
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 07, 2023
Real-time stream processing for python

Streamz Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelin

Python Streamz 1.1k Dec 28, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 08, 2023
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

Jose A Dianes 1.5k Jan 02, 2023
This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

ARM This jupyter notebook project was completed by me and my friend using the dataset from Kaggle. The world Happiness 2017, which ranks 155 countries

1 Jan 23, 2022
Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

formulae formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations li

34 Dec 21, 2022
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021
A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

VijayAadhithya2019rit 1 Feb 02, 2022
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Intel(R) Extension for Scikit-learn* Installation | Documentation | Examples | Support | FAQ With Intel(R) Extension for Scikit-learn you can accelera

Intel Corporation 858 Dec 25, 2022
A collection of machine learning examples and tutorials.

machine_learning_examples A collection of machine learning examples and tutorials.

LazyProgrammer.me 7.1k Jan 01, 2023
Forecast dynamically at scale with this unique package. pip install scalecast

🌄 Scalecast: Dynamic Forecasting at Scale About This package uses a scaleable forecasting approach in Python with common scikit-learn and statsmodels

Michael Keith 158 Jan 03, 2023
A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

pyUpSet A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al. Contents Purpose How to install How it work

288 Jan 04, 2023
Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

PyTASER PyTASER is a Python (3.9+) library and set of command-line tools for classifying spectral features in bulk materials, post-DFT. The goal of th

Materials Design Group 4 Dec 27, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 03, 2023
Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

Arjun U 1 Jan 23, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 09, 2023
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 04, 2022
Machine Learning for RC Cars

Suiron Machine Learning for RC Cars Prediction visualization (green = actual, blue = prediction) Click the video below to see it in action! Dependenci

Kendrick Tan 706 Jan 02, 2023
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022