Extended Isolation Forest for Anomaly Detection

Related tags

Machine Learningeif
Overview

latest releasepypi version

Table of contents

Extended Isolation Forest

This is a simple Python implementation for the Extended Isolation Forest method described in this (https://doi.org/10.1109/TKDE.2019.2947676). It is an improvement on the original algorithm Isolation Forest which is described (among other places) in this paper for detecting anomalies and outliers for multidimensional data point distributions. An R wrapper around the core Python implementation can be found here.

Summary

The problem of anomaly detection has wide range of applications in various fields and scientific applications. Anomalous data can have as much scientific value as normal data or in some cases even more, and it is of vital importance to have robust, fast and reliable algorithms to detect and flag such anomalies. Here, we present an extension to the model-free anomaly detection algorithm, Isolation Forest Liu2008. This extension, named Extended Isolation Forest (EIF), improves the consistency and reliability of the anomaly score produced by standard methods for a given data point. We show that the standard Isolation Forest produces inconsistent anomaly score maps, and that these score maps suffer from an artifact produced as a result of how the criteria for branching operation of the binary tree is selected.

Our method allows for the slicing of the data to be done using hyperplanes with random slopes which results in improved score maps. The consistency and reliability of the algorithm is much improved using this extension. Here we show the need for an improvement on the source algorithm to improve the scoring of anomalies and the robustness of the score maps especially around edges of nominal data. We discuss the sources of the problem, and we present an efficient way for choosing these hyperplanes which give way to multiple extension levels in the case of higher dimensional data. The standard Isolation Forest is therefore a special case of the Extended Isolation Forest as presented it here. For an N dimensional dataset, Extended Isolation Forest has N levels of extension, with 0 being identical to the case of standard Isolation Forest, and N-1 being the fully extended version.

Motivation

Example training data. a) Normally distributed cluster. b) Two normally distributed clusters. c) Sinusoidal data points with Gaussian noise.

Figure 1: Example training data. a) Normally distributed cluster. b) Two normally distributed clusters. c) Sinusoidal data points with Gaussian noise.

While various techniques exist for approaching anomaly detection, Isolation Forest Liu2008 is one with unique capabilities. This algorithm can readily work on high dimensional data, it is model free, and it scales well. It is therefore highly desirable and easy to use. However, looking at score maps for some basic example, we can see that the anomaly scores produced by the standard Isolation Forest are inconsistent, . To see this we look at the three examples shown in Figure 1.

In each case, we use the data to train our Isolation Forest. We then use the trained models to score a square grid of uniformly distributed data points, which results in score maps shown in Figure 2. Through the simplicity of the example data, we have an intuition about what the score maps should look like. For example, for the data shown in Figure 1a, we expect to see low anomaly scores in the center of the map, while the anomaly score should increase as we move radially away from the center. Similarly for the other figures.

Looking at the score maps produced by the standard Isolation Forest shown in Figure 2, we can clearly see the inconsistencies in the scores. While we can clearly see a region of low anomaly score in the center in Figure 2a, we can also see regions aligned with x and y axes passing through the origin that have lower anomaly scores compared to the four corners of the region. Based on our intuitive understanding of the data, this cannot be correct. A similar phenomenon is observed in Figure 2b. In this case, the problem is amplified. Since there are two clusters, the artificially low anomaly score regions intersect close to points (0,0) and (10,10), and create low anomaly score regions where there is no data. It is immediately obvious how this can be problematic. As for the third example, figure 2c shows that the structure of the data is completely lost. The sinusoidal shape is essentially treated as one rectangular blob.

Score maps using the Standard Isolation Forest for the points from Figure 1. We can see the bands and artifacts on these maps

Figure 2: Score maps using the Standard Isolation Forest for the points from Figure 1. We can see the bands and artifacts on these maps

Isolation Forest

Given a dataset of dimension N, the algorithm chooses a random sub-sample of data to construct a binary tree. The branching process of the tree occurs by selecting a random dimension x_i with i in {1,2,...,N} of the data (a single variable). It then selects a random value v within the minimum and maximum values in that dimension. If a given data point possesses a value smaller than v for dimension x_i, then that point is sent to the left branch, otherwise it is sent to the right branch. In this manner the data on the current node of the tree is split in two. This process of branching is performed recursively over the dataset until a single point is isolated, or a predetermined depth limit is reached. The process begins again with a new random sub-sample to build another randomized tree. After building a large ensemble of trees, i.e. a forest, the training is complete.

During the scoring step, a new candidate data point (or one chosen from the data used to create the trees) is run through all the trees, and an ensemble anomaly score is assigned based on the depth the point reaches in each tree. Figure 3 shows an schematic example of a tree and a forest plotted radially.

a) Shows an example tree formed from the example data while b) shows the forest generated where each tree is represented by a radial line from the center to  the  outer  circle.  Anomalous  points  (shown  in  red)  are  isolated  very  quickly,which means they reach shallower depths than nominal points (shown in blue).

Figure 3: a) Shows an example tree formed from the example data while b) shows the forest generated where each tree is represented by a radial line from the center to the outer circle. Anomalous points (shown in red) are isolated very quickly,which means they reach shallower depths than nominal points (shown in blue).

It turns out the splitting process described above is the main source of the bias observed in the score maps. Figure 4 shows the process described above for each one of the examples considered thus far. The branch cuts are always parallel to the axes, and as a result over construction of many trees, regions in the domain that don't occupy any data points receive superfluous branch cuts.

Splitting of data in the domain during the process of construction of one tree.

Figure 4: Splitting of data in the domain during the process of construction of one tree.

Extension

The Extended Isolation Forest remedies this problem by allowing the branching process to occur in every direction. The process of choosing branch cuts is altered so that at each node, instead of choosing a random feature along with a random value, we choose a random normal vector along with a random intercept point.

Figure 5 shows the resulting branch cuts int he domain for each of our examples.

Same as Figure 4 but using Extended Isolation Forest

Figure 5: Same as Figure 4 but using Extended Isolation Forest

We can see that the region is divided much more uniformly, and without the bias introducing effects of the coordinate system. As in the case of the standard Isolation Forest, the anomaly score is computed by the aggregated depth that a given point reaches on each iTree.

As we see in Figure 6, these modifications completely fix the issue with the score maps that we saw before and produce reliable results. Clearly, these score maps are a much better representation of anomaly score distributions.

Score maps using the Extended Isolation Forest.

Figure 6: Score maps using the Extended Isolation Forest.

Figure 7 shows a very simple example of anomalies and nominal points from a Single blob example as shown in Figure 1a. It also shows the distribution of the anomaly scores which can be used to make hard cuts on the definition of anomalies or even assign probabilities to each point.

a) Shows the dataset used, some sample anomalous data points discovered using the algorithm are highlighted in black. We also highlight some nominal points in red. In b), we have the distribution of anomaly scores obtained by the algorithm.

Figure 7: a) Shows the dataset used, some sample anomalous data points discovered using the algorithm are highlighted in black. We also highlight some nominal points in red. In b), we have the distribution of anomaly scores obtained by the algorithm.

The Code

Here we provide the source code for the algorithm as well as documented example notebooks to help get started. Various visualizations are provided such as score distributions, score maps, aggregate slicing of the domain, and tree and whole forest visualizations. Most examples are in 2D. We present one 3D example. However, the algorithm works readily with higher dimensional data.

Installation

pip install eif

or directly from the repository

pip install git+https://github.com/sahandha/eif.git

Alternatively, you can install the eif R package from here, which provides an R wrapper around the core Python implementation.

Requirements

  • numpy
  • cython

No extra requirements are needed. In addition, it also contains means to draw the trees created using the igraph library. See the example for tree visualizations.

Use

See these notebooks for examples on how to use it

Citation

If you use this code and method, please considering using the following reference:

A link to the paper can be found here

@ARTICLE{8888179,
author={S. {Hariri} and M. {Carrasco Kind} and R. J. {Brunner}},
journal={IEEE Transactions on Knowledge and Data Engineering},
title={Extended Isolation Forest},
year={2019},
volume={},
number={},
pages={1-1},
keywords={Forestry;Vegetation;Distributed databases;Anomaly detection;Standards;Clustering algorithms;Heating systems;Anomaly Detection;Isolation Forest},
doi={10.1109/TKDE.2019.2947676},
ISSN={},
month={},}

Releases

v2.0.2

2019-NOV-14

  • Convert code into C++ with using cython.
  • Much faster and efficient forest generation and scoring procedures.
  • Previous implementation renamed, use import eif_old to use old version

v1.0.2

2018-OCT-01

  • Release
  • Added documentation, examples and software paper

v1.0.1

2018-AUG-08

  • Bugfix for multidimensional data

v1.0.0

2018-JUL-15

  • Initial Release
Comments
  • Error while installing eif

    Error while installing eif

    Hi!

    Trying to install eif through pip I get the following error:

    
    (base) C:\WINDOWS\system32>pip install eif
    Collecting eif
      Using cached https://files.pythonhosted.org/packages/83/b2/d87d869deeb192ab599c899b91a9ad1d3775d04f5b7adcaf7ff6daa54c24/eif-2.0.2.tar.gz
    Requirement already satisfied: numpy in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (1.16.5)
    Requirement already satisfied: cython in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (0.29.13)
    Building wheels for collected packages: eif
      Building wheel for eif (setup.py) ... error
      ERROR: Command errored out with exit status 1:
       command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-wheel-kw_2kpwv' --python-tag cp37
           cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
      Complete output (60 lines):
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.7
      copying eif_old.py -> build\lib.win-amd64-3.7
      copying version.py -> build\lib.win-amd64-3.7
      running egg_info
      writing eif.egg-info\PKG-INFO
      writing dependency_links to eif.egg-info\dependency_links.txt
      writing requirements to eif.egg-info\requires.txt
      writing top-level names to eif.egg-info\top_level.txt
      reading manifest file 'eif.egg-info\SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      writing manifest file 'eif.egg-info\SOURCES.txt'
      running build_ext
      cythoning _eif.pyx to _eif.cpp
      building 'eif' extension
      creating build\temp.win-amd64-3.7
      creating build\temp.win-amd64-3.7\Release
      C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
      In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                       from eif.hxx:5,
                       from _eif.cpp:614:
      C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
       #error This file requires compiler and library support for the \
        ^
      In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                       from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                       from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                       from _eif.cpp:612:
      C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                                "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                                   ^
      In file included from _eif.cpp:614:0:
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
               void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                             ^
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
               Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                                  ^
      eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
       #define RANDOM_ENGINE std::mt19937_64
                                  ^
      eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
       inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                     ^
      _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
      _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
                   module_name, class_name, size, basicsize);
                                                           ^
      _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
      _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
      error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
      ----------------------------------------
      ERROR: Failed building wheel for eif
      Running setup.py clean for eif
    Failed to build eif
    Installing collected packages: eif
        Running setup.py install for eif ... error
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile
             cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
        Complete output (60 lines):
        running install
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.7
        copying eif_old.py -> build\lib.win-amd64-3.7
        copying version.py -> build\lib.win-amd64-3.7
        running egg_info
        writing eif.egg-info\PKG-INFO
        writing dependency_links to eif.egg-info\dependency_links.txt
        writing requirements to eif.egg-info\requires.txt
        writing top-level names to eif.egg-info\top_level.txt
        reading manifest file 'eif.egg-info\SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        writing manifest file 'eif.egg-info\SOURCES.txt'
        running build_ext
        skipping '_eif.cpp' Cython extension (up-to-date)
        building 'eif' extension
        creating build\temp.win-amd64-3.7
        creating build\temp.win-amd64-3.7\Release
        C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
        In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                         from eif.hxx:5,
                         from _eif.cpp:614:
        C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
         #error This file requires compiler and library support for the \
          ^
        In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                         from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                         from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                         from _eif.cpp:612:
        C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                                  "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                                     ^
        In file included from _eif.cpp:614:0:
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
                 void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                               ^
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
                 Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                                    ^
        eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
         #define RANDOM_ENGINE std::mt19937_64
                                    ^
        eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
         inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                       ^
        _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
        _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
                     module_name, class_name, size, basicsize);
                                                             ^
        _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
        _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
        error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
        ----------------------------------------
    ERROR: Command errored out with exit status 1: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.
    
    
    opened by PoradaKev 25
  • Can the extension concept Applied to Gradient Boosted Machine?

    Can the extension concept Applied to Gradient Boosted Machine?

    Hi there,

    This might be dummy questions.

    I was curious whether the "extension" concept that you introduce can be applied to Supervised version such as Gradient Boosted Trees algorithm or not. There was several widely known Implementation like XGBoost or LightGBM. All of these GBT also suffer from "box" like decision boundary. I believe it would be great to see GBT to create decision boundary the way your extended isolation forest was producing.

    What do you guys think?

    Feel free to close this issue since its not real issue, just discussion.

    opened by alfian777 5
  • Installation problem

    Installation problem

    Hello, i'm trying to install this package, and i'm having error messages and i don't get to install it. Can you help?

    Windows 10

    (base) C:\Users\quirosgu>pip install eif Collecting eif Using cached eif-2.0.2.tar.gz (1.6 MB) Requirement already satisfied: numpy in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (1.18.5) Requirement already satisfied: cython in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (0.29.21) Building wheels for collected packages: eif Building wheel for eif (setup.py) ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\quirosgu\AppData\Local\Temp\pip-wheel-6t9epked' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
    Complete output (19 lines): running bdist_wheel running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

    ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
    Complete output (19 lines): running install running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/ ---------------------------------------- ERROR: Command errored out with exit status 1: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' Check the logs for full command output.

    In this case, i already installed all the dependencies required MVC++, etc, but the problem continues.

    I tried to reproduce it in another WIndows machine and it does not work, contrary, in a Linux based system it does work.

    opened by luigiquiros 3
  • PR for Parallelization and Reduce Memory

    PR for Parallelization and Reduce Memory

    Hello,

    For high dimensional datasets, I'm finding multi-processing parallelization can speed things up a bit. I also, find that storing the original data in each Node and each iTree consumes a lot of needless memory. Would you be open to reviewing a Pull Request(s) that addressed both of these items? If so, would you accept them bundled together as one PR or would you like them separated?

    Thanks

    opened by pford221 3
  • Use in novelty detection/one-class classification

    Use in novelty detection/one-class classification

    From what I understand, your api doesn't distinguish between constructing the trees and querying to obtain scores (like the fit/predict methods of scikit-learn), is that correct?

    So it's not currently possible to use this implementation for novelty detection/one-class classification, where the training set is different from the test set?

    opened by oulenz 2
  • Scoring takes too long

    Scoring takes too long

    My training and validation data are of similar size (about 1,500,000 rows and 11 features). Model building took very less time even with full extension. But, when scoring the validation data using compute_paths, the function has been running for close to 15 hours and still scoring is not done. Is there some way to speed up the scoring process?

    opened by thedarklord310780 2
  • Add Arxiv paper to readme

    Add Arxiv paper to readme

    Thanks for providing this code. Please add mention of and a link to your associated Arxiv paper into the repo's readme. The link is https://arxiv.org/abs/1811.02141

    opened by impredicative 2
  • setting ExtensionLevel

    setting ExtensionLevel

    If I understand the paper correctly, we obtain the full EIF approach by setting ExtensionLevel equal to the number of dimensions of the data minus 1, correct?

    opened by oulenz 1
  • Small fix install progress

    Small fix install progress

    One of the extra compile arguments in setup.py seemed to prevent successful installation on multiple systems. Simply removing this argument seems to resolve this with no negative implications. The argument seems to try and force the compiler to run in c++11. Unsure if this was even present on the tested systems

    opened by Dainean 1
  • Update eif.py

    Update eif.py

    Goal: for more convenient usage Inspired by the tutorial document, I added two functions, outlier_pred and outlier_index into iForest, which returns the outlier prediction index and label matrix.

    opened by MaiRajborirug 0
  • How to save the eif Model?

    How to save the eif Model?

    I am trying to save the model using pickle.dump() but this not working. How do I save the eif model? Please provide me a solution as I am stuck with this problem. Thank you.

    opened by SanthanaMano 0
  • module 'eif' has no attribute '__version__'

    module 'eif' has no attribute '__version__'

    i install eif by "pip install eif" and Successfully installed eif-2.0.2 but when i use eif.iForest arise attributeError: module 'eif' has no attribute 'version'

    opened by wererLinC 0
  • I can't install eif 2.0.2, please tell me the reason

    I can't install eif 2.0.2, please tell me the reason

    (base) C:\Users\22393\eif-2.0.2\eif-2.0.2>python setup.py install running install running bdist_egg running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_py running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IE:\ProgramFiles\anaconda\lib\site-packages\numpy\core\include -IE:\ProgramFiles\anaconda\include -IE:\ProgramFiles\anaconda\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.8\Release_eif.obj -Wcpp cl: 命令行 error D8021 :无效的数值参数“/Wcpp” error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit status 2

    opened by whmwhm123 0
  • Unable to install eif2.0.2

    Unable to install eif2.0.2

    Dear Team, I am getting below error while trying install eif2.02 . Methods tried:

    1. pip install eif
    2. Downloaded eif tar file from pypi.org and tried installing
    3. Downloaded the code from github and tried installing
    4. In one of the issue it is mentioned to edit setup.py file(Remove the extra_compile line) and executed

    failed in all above methods, Below is the error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-wheel-wjzxwp64' --python-tag cp37: ERROR: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

    ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile: ERROR: running install running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2 ---------------------------------------- ERROR: Command "'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\

    Please help.

    opened by botlavijaykumar 1
  • Effect of feature scaling

    Effect of feature scaling

    Hi thanks for the great package (and example notebooks!). My issue is summarised in two points:

    • It appears that feature scale influences the orientation of the hyperplane splits in the trees, resulting in a poor anomaly score map.
    • Is this expected behaviour? If so, can anyone offer an explanation as to how this comes about as it seems from the paper that the orientation of all hyperplanes are random.

    The following illustrates this further:

    I have noticed that the extended forest shows odd results when applied to features with very different scales. For example if I draw 2D points from 2 normal distributions with variance 1 and 1000 and plot the contour maps comparing the regular iForest and the extended we see the contours become horizontal and the heat map in general is not good compared to the regular iForest. image

    It seems as though the choice of hyperplane gets biased towards horizontal lines. This is also notable in the examples given in the paper (figure 9) where 3 plots of tree splits are shown: image Here we see the first two examples (a and b) the x and y values of the data lie on the same scale and the splits look randomly orientated. However in c) the x scale of the data is much larger than y scale, and most splits look more vertical. As a result we seen areas of higher anomaly score above and below the point cloud in the resulting heat map: image

    This issue is easily fixed by simply scaling all features before using the forest. However I was wondering if the splits are done on a hyperplane of random orientation why/how does feature scale influence the orientation of splits in each tree?

    Apologies if I am missing something obvious, any insight would be useful, thanks!

    opened by felixcaz 0
Releases(v2.0.2)
Owner
Sahand Hariri
Sahand Hariri
Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any student(s) having the second lowest grade.

Hackerank-Nested-List Given the names and grades for each student in a class N of students, store them in a nested list and print the name(s) of any s

Sangeeth Mathew John 2 Dec 14, 2021
Model factory is a ML training platform to help engineers to build ML models at scale

Model Factory Machine learning today is powering many businesses today, e.g., search engine, e-commerce, news or feed recommendation. Training high qu

16 Sep 23, 2022
ML-powered Loan-Marketer Customer Filtering Engine

In Loan-Marketing business employees are required to call the user's to buy loans of several fields and in several magnitudes. If employees are calling everybody in the network it is also very length

Sagnik Roy 13 Jul 02, 2022
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022
PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

PROTEIN-EXPRESSION-ANALYSIS-FOR-DOWN-SYNDROME Down syndrome (DS) is a chromosomal disorder where organisms have an extra chromosome 21, sometimes know

1 Jan 20, 2022
TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

Mikhail Trofimov 785 Dec 21, 2022
Deep Survival Machines - Fully Parametric Survival Regression

Package: dsm Python package dsm provides an API to train the Deep Survival Machines and associated models for problems in survival analysis. The under

Carnegie Mellon University Auton Lab 10 Dec 30, 2022
💀mummify: a version control tool for machine learning

mummify is a version control tool for machine learning. It's simple, fast, and designed for model prototyping.

Max Humber 43 Jul 09, 2022
using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

Mohammed Hussien 2 May 02, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 09, 2023
Python package for causal inference using Bayesian structural time-series models.

Python Causal Impact Causal inference using Bayesian structural time-series models. This package aims at defining a python equivalent of the R CausalI

Thomas Cassou 219 Dec 11, 2022
Toolkit for building machine learning models that generalize to unseen domains and are robust to privacy and other attacks.

Toolkit for Building Robust ML models that generalize to unseen domains (RobustDG) Divyat Mahajan, Shruti Tople, Amit Sharma Privacy & Causal Learning

Microsoft 149 Jan 06, 2023
Open MLOps - A Production-focused Open-Source Machine Learning Framework

Open MLOps - A Production-focused Open-Source Machine Learning Framework Open MLOps is a set of open-source tools carefully chosen to ease user experi

Data Revenue 590 Dec 28, 2022
Continuously evaluated, functional, incremental, time-series forecasting

timemachines Autonomous, univariate, k-step ahead time-series forecasting functions assigned Elo ratings You can: Use some of the functionality of a s

Peter Cotton 343 Jan 04, 2023
A machine learning web application for binary classification using streamlit

Machine Learning web App This is a machine learning web application for binary classification using streamlit options this application contains 3 clas

abdelhak mokri 1 Dec 20, 2021
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021
Azure MLOps (v2) solution accelerators.

Azure MLOps (v2) solution accelerator Welcome to the MLOps (v2) solution accelerator repository! This project is intended to serve as the starting poi

Microsoft Azure 233 Jan 01, 2023