Extended Isolation Forest for Anomaly Detection

Last update: Dec 18, 2022

Related tags

Overview

Extended Isolation Forest
Installation
- Requirements
Use
Citation
Releases

Extended Isolation Forest

This is a simple Python implementation for the Extended Isolation Forest method described in this (https://doi.org/10.1109/TKDE.2019.2947676). It is an improvement on the original algorithm Isolation Forest which is described (among other places) in this paper for detecting anomalies and outliers for multidimensional data point distributions. An R wrapper around the core Python implementation can be found here.

Summary

The problem of anomaly detection has wide range of applications in various fields and scientific applications. Anomalous data can have as much scientific value as normal data or in some cases even more, and it is of vital importance to have robust, fast and reliable algorithms to detect and flag such anomalies. Here, we present an extension to the model-free anomaly detection algorithm, Isolation Forest Liu2008. This extension, named Extended Isolation Forest (EIF), improves the consistency and reliability of the anomaly score produced by standard methods for a given data point. We show that the standard Isolation Forest produces inconsistent anomaly score maps, and that these score maps suffer from an artifact produced as a result of how the criteria for branching operation of the binary tree is selected.

Our method allows for the slicing of the data to be done using hyperplanes with random slopes which results in improved score maps. The consistency and reliability of the algorithm is much improved using this extension. Here we show the need for an improvement on the source algorithm to improve the scoring of anomalies and the robustness of the score maps especially around edges of nominal data. We discuss the sources of the problem, and we present an efficient way for choosing these hyperplanes which give way to multiple extension levels in the case of higher dimensional data. The standard Isolation Forest is therefore a special case of the Extended Isolation Forest as presented it here. For an N dimensional dataset, Extended Isolation Forest has N levels of extension, with 0 being identical to the case of standard Isolation Forest, and N-1 being the fully extended version.

Motivation

Figure 1: Example training data. a) Normally distributed cluster. b) Two normally distributed clusters. c) Sinusoidal data points with Gaussian noise.

While various techniques exist for approaching anomaly detection, Isolation Forest Liu2008 is one with unique capabilities. This algorithm can readily work on high dimensional data, it is model free, and it scales well. It is therefore highly desirable and easy to use. However, looking at score maps for some basic example, we can see that the anomaly scores produced by the standard Isolation Forest are inconsistent, . To see this we look at the three examples shown in Figure 1.

In each case, we use the data to train our Isolation Forest. We then use the trained models to score a square grid of uniformly distributed data points, which results in score maps shown in Figure 2. Through the simplicity of the example data, we have an intuition about what the score maps should look like. For example, for the data shown in Figure 1a, we expect to see low anomaly scores in the center of the map, while the anomaly score should increase as we move radially away from the center. Similarly for the other figures.

Looking at the score maps produced by the standard Isolation Forest shown in Figure 2, we can clearly see the inconsistencies in the scores. While we can clearly see a region of low anomaly score in the center in Figure 2a, we can also see regions aligned with x and y axes passing through the origin that have lower anomaly scores compared to the four corners of the region. Based on our intuitive understanding of the data, this cannot be correct. A similar phenomenon is observed in Figure 2b. In this case, the problem is amplified. Since there are two clusters, the artificially low anomaly score regions intersect close to points (0,0) and (10,10), and create low anomaly score regions where there is no data. It is immediately obvious how this can be problematic. As for the third example, figure 2c shows that the structure of the data is completely lost. The sinusoidal shape is essentially treated as one rectangular blob.

Figure 2: Score maps using the Standard Isolation Forest for the points from Figure 1. We can see the bands and artifacts on these maps

Isolation Forest

Given a dataset of dimension N, the algorithm chooses a random sub-sample of data to construct a binary tree. The branching process of the tree occurs by selecting a random dimension x_i with i in {1,2,...,N} of the data (a single variable). It then selects a random value v within the minimum and maximum values in that dimension. If a given data point possesses a value smaller than v for dimension x_i, then that point is sent to the left branch, otherwise it is sent to the right branch. In this manner the data on the current node of the tree is split in two. This process of branching is performed recursively over the dataset until a single point is isolated, or a predetermined depth limit is reached. The process begins again with a new random sub-sample to build another randomized tree. After building a large ensemble of trees, i.e. a forest, the training is complete.

During the scoring step, a new candidate data point (or one chosen from the data used to create the trees) is run through all the trees, and an ensemble anomaly score is assigned based on the depth the point reaches in each tree. Figure 3 shows an schematic example of a tree and a forest plotted radially.

Figure 3: a) Shows an example tree formed from the example data while b) shows the forest generated where each tree is represented by a radial line from the center to the outer circle. Anomalous points (shown in red) are isolated very quickly,which means they reach shallower depths than nominal points (shown in blue).

It turns out the splitting process described above is the main source of the bias observed in the score maps. Figure 4 shows the process described above for each one of the examples considered thus far. The branch cuts are always parallel to the axes, and as a result over construction of many trees, regions in the domain that don't occupy any data points receive superfluous branch cuts.

Figure 4: Splitting of data in the domain during the process of construction of one tree.

Extension

The Extended Isolation Forest remedies this problem by allowing the branching process to occur in every direction. The process of choosing branch cuts is altered so that at each node, instead of choosing a random feature along with a random value, we choose a random normal vector along with a random intercept point.

Figure 5 shows the resulting branch cuts int he domain for each of our examples.

Figure 5: Same as Figure 4 but using Extended Isolation Forest

We can see that the region is divided much more uniformly, and without the bias introducing effects of the coordinate system. As in the case of the standard Isolation Forest, the anomaly score is computed by the aggregated depth that a given point reaches on each iTree.

As we see in Figure 6, these modifications completely fix the issue with the score maps that we saw before and produce reliable results. Clearly, these score maps are a much better representation of anomaly score distributions.

Figure 6: Score maps using the Extended Isolation Forest.

Figure 7 shows a very simple example of anomalies and nominal points from a Single blob example as shown in Figure 1a. It also shows the distribution of the anomaly scores which can be used to make hard cuts on the definition of anomalies or even assign probabilities to each point.

Figure 7: a) Shows the dataset used, some sample anomalous data points discovered using the algorithm are highlighted in black. We also highlight some nominal points in red. In b), we have the distribution of anomaly scores obtained by the algorithm.

The Code

Here we provide the source code for the algorithm as well as documented example notebooks to help get started. Various visualizations are provided such as score distributions, score maps, aggregate slicing of the domain, and tree and whole forest visualizations. Most examples are in 2D. We present one 3D example. However, the algorithm works readily with higher dimensional data.

Installation

pip install eif

or directly from the repository

pip install git+https://github.com/sahandha/eif.git

Alternatively, you can install the eif R package from here, which provides an R wrapper around the core Python implementation.

Requirements

numpy
cython

No extra requirements are needed. In addition, it also contains means to draw the trees created using the igraph library. See the example for tree visualizations.

Use

See these notebooks for examples on how to use it

Citation

If you use this code and method, please considering using the following reference:

A link to the paper can be found here

@ARTICLE{8888179,
author={S. {Hariri} and M. {Carrasco Kind} and R. J. {Brunner}},
journal={IEEE Transactions on Knowledge and Data Engineering},
title={Extended Isolation Forest},
year={2019},
volume={},
number={},
pages={1-1},
keywords={Forestry;Vegetation;Distributed databases;Anomaly detection;Standards;Clustering algorithms;Heating systems;Anomaly Detection;Isolation Forest},
doi={10.1109/TKDE.2019.2947676},
ISSN={},
month={},}

Releases

v2.0.2

2019-NOV-14

Convert code into C++ with using cython.
Much faster and efficient forest generation and scoring procedures.
Previous implementation renamed, use import eif_old to use old version

v1.0.2

2018-OCT-01

Release
Added documentation, examples and software paper

v1.0.1

2018-AUG-08

Bugfix for multidimensional data

v1.0.0

2018-JUL-15

Initial Release

Comments

Error while installing eif

Hi!

Trying to install eif through pip I get the following error:


(base) C:\WINDOWS\system32>pip install eif
Collecting eif
  Using cached https://files.pythonhosted.org/packages/83/b2/d87d869deeb192ab599c899b91a9ad1d3775d04f5b7adcaf7ff6daa54c24/eif-2.0.2.tar.gz
Requirement already satisfied: numpy in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (1.16.5)
Requirement already satisfied: cython in c:\users\o.korshun\appdata\local\continuum\anaconda3\lib\site-packages (from eif) (0.29.13)
Building wheels for collected packages: eif
  Building wheel for eif (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-wheel-kw_2kpwv' --python-tag cp37
       cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
  Complete output (60 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.7
  copying eif_old.py -> build\lib.win-amd64-3.7
  copying version.py -> build\lib.win-amd64-3.7
  running egg_info
  writing eif.egg-info\PKG-INFO
  writing dependency_links to eif.egg-info\dependency_links.txt
  writing requirements to eif.egg-info\requires.txt
  writing top-level names to eif.egg-info\top_level.txt
  reading manifest file 'eif.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  writing manifest file 'eif.egg-info\SOURCES.txt'
  running build_ext
  cythoning _eif.pyx to _eif.cpp
  building 'eif' extension
  creating build\temp.win-amd64-3.7
  creating build\temp.win-amd64-3.7\Release
  C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
  In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                   from eif.hxx:5,
                   from _eif.cpp:614:
  C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
   #error This file requires compiler and library support for the \
    ^
  In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                   from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                   from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                   from _eif.cpp:612:
  C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                            "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                               ^
  In file included from _eif.cpp:614:0:
  eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
   #define RANDOM_ENGINE std::mt19937_64
                              ^
  eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
           void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                         ^
  eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
   #define RANDOM_ENGINE std::mt19937_64
                              ^
  eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
           Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                              ^
  eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
   #define RANDOM_ENGINE std::mt19937_64
                              ^
  eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
   inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                 ^
  _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
  _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
               module_name, class_name, size, basicsize);
                                                       ^
  _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
  _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
  error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for eif
  Running setup.py clean for eif
Failed to build eif
Installing collected packages: eif
    Running setup.py install for eif ... error
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile
         cwd: C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-install-1adywqes\eif\
    Complete output (60 lines):
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.7
    copying eif_old.py -> build\lib.win-amd64-3.7
    copying version.py -> build\lib.win-amd64-3.7
    running egg_info
    writing eif.egg-info\PKG-INFO
    writing dependency_links to eif.egg-info\dependency_links.txt
    writing requirements to eif.egg-info\requires.txt
    writing top-level names to eif.egg-info\top_level.txt
    reading manifest file 'eif.egg-info\SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    writing manifest file 'eif.egg-info\SOURCES.txt'
    running build_ext
    skipping '_eif.cpp' Cython extension (up-to-date)
    building 'eif' extension
    creating build\temp.win-amd64-3.7
    creating build\temp.win-amd64-3.7\Release
    C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -IC:\Users\o.korshun\AppData\Local\Continuum\anaconda3\include -c _eif.cpp -o build\temp.win-amd64-3.7\Release\_eif.o -Wcpp
    In file included from C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/random:35:0,
                     from eif.hxx:5,
                     from _eif.cpp:614:
    C:/Users/o.korshun/AppData/Local/Continuum/anaconda3/Library/mingw-w64/include/c++/5.3.0/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support is currently experimental, and must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
     #error This file requires compiler and library support for the \
      ^
    In file included from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarraytypes.h:1822:0,
                     from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/ndarrayobject.h:12,
                     from C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/arrayobject.h:4,
                     from _eif.cpp:612:
    C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h:15:77: note: #pragma message: C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\include/numpy/npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
                              "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION")
                                                                                 ^
    In file included from _eif.cpp:614:0:
    eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
     #define RANDOM_ENGINE std::mt19937_64
                                ^
    eif.hxx:65:55: note: in expansion of macro 'RANDOM_ENGINE'
             void build_tree (double*, int, int, int, int, RANDOM_ENGINE&, int);
                                                           ^
    eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
     #define RANDOM_ENGINE std::mt19937_64
                                ^
    eif.hxx:66:44: note: in expansion of macro 'RANDOM_ENGINE'
             Node* add_node (double*, int, int, RANDOM_ENGINE&);
                                                ^
    eif.hxx:11:28: error: 'std::mt19937_64' has not been declared
     #define RANDOM_ENGINE std::mt19937_64
                                ^
    eif.hxx:132:63: note: in expansion of macro 'RANDOM_ENGINE'
     inline std::vector<int> sample_without_replacement (int, int, RANDOM_ENGINE&);
                                                                   ^
    _eif.cpp: In function 'PyTypeObject* __Pyx_ImportType(PyObject*, const char*, const char*, size_t, __Pyx_ImportType_CheckSize)':
    _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
                 module_name, class_name, size, basicsize);
                                                         ^
    _eif.cpp:8085:53: warning: unknown conversion type character 'z' in format [-Wformat=]
    _eif.cpp:8085:53: warning: too many arguments for format [-Wformat-extra-args]
    error: command 'C:\\Users\\o.korshun\\AppData\\Local\\Continuum\\anaconda3\\Library\\mingw-w64\\bin\\gcc.exe' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'C:\Users\o.korshun\AppData\Local\Continuum\anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"'; __file__='"'"'C:\\Users\\O0ADF~1.KOR\\AppData\\Local\\Temp\\pip-install-1adywqes\\eif\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\O0ADF~1.KOR\AppData\Local\Temp\pip-record-yqa9lmac\install-record.txt' --single-version-externally-managed --compile Check the logs for full command output.

opened by PoradaKev 25

Can the extension concept Applied to Gradient Boosted Machine?

Hi there,

This might be dummy questions.

I was curious whether the "extension" concept that you introduce can be applied to Supervised version such as Gradient Boosted Trees algorithm or not. There was several widely known Implementation like XGBoost or LightGBM. All of these GBT also suffer from "box" like decision boundary. I believe it would be great to see GBT to create decision boundary the way your extended isolation forest was producing.

What do you guys think?

Feel free to close this issue since its not real issue, just discussion.

opened by alfian777 5
Installation problem

Hello, i'm trying to install this package, and i'm having error messages and i don't get to install it. Can you help?

Windows 10

(base) C:\Users\quirosgu>pip install eif Collecting eif Using cached eif-2.0.2.tar.gz (1.6 MB) Requirement already satisfied: numpy in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (1.18.5) Requirement already satisfied: cython in c:\users\quirosgu\anaconda3\lib\site-packages (from eif) (0.29.21) Building wheels for collected packages: eif Building wheel for eif (setup.py) ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\quirosgu\AppData\Local\Temp\pip-wheel-6t9epked' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
Complete output (19 lines): running bdist_wheel running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Command errored out with exit status 1: command: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' cwd: C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif
Complete output (19 lines): running install running build running build_py creating build creating build\lib.win32-3.8 copying eif_old.py -> build\lib.win32-3.8 copying version.py -> build\lib.win32-3.8 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/ ---------------------------------------- ERROR: Command errored out with exit status 1: 'C:\Users\quirosgu\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"'; file='"'"'C:\Users\quirosgu\AppData\Local\Temp\pip-install-wz5r6gph\eif\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\quirosgu\AppData\Local\Temp\pip-record-fjpa9g_k\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\quirosgu\Anaconda3\Include\eif' Check the logs for full command output.

In this case, i already installed all the dependencies required MVC++, etc, but the problem continues.

I tried to reproduce it in another WIndows machine and it does not work, contrary, in a Linux based system it does work.

opened by luigiquiros 3
PR for Parallelization and Reduce Memory

Hello,

For high dimensional datasets, I'm finding multi-processing parallelization can speed things up a bit. I also, find that storing the original data in each Node and each iTree consumes a lot of needless memory. Would you be open to reviewing a Pull Request(s) that addressed both of these items? If so, would you accept them bundled together as one PR or would you like them separated?

Thanks

opened by pford221 3
Use in novelty detection/one-class classification

From what I understand, your api doesn't distinguish between constructing the trees and querying to obtain scores (like the fit/predict methods of scikit-learn), is that correct?

So it's not currently possible to use this implementation for novelty detection/one-class classification, where the training set is different from the test set?

opened by oulenz 2
Scoring takes too long

My training and validation data are of similar size (about 1,500,000 rows and 11 features). Model building took very less time even with full extension. But, when scoring the validation data using compute_paths, the function has been running for close to 15 hours and still scoring is not done. Is there some way to speed up the scoring process?

opened by thedarklord310780 2
Add Arxiv paper to readme

Thanks for providing this code. Please add mention of and a link to your associated Arxiv paper into the repo's readme. The link is https://arxiv.org/abs/1811.02141

opened by impredicative 2
setting ExtensionLevel

If I understand the paper correctly, we obtain the full EIF approach by setting ExtensionLevel equal to the number of dimensions of the data minus 1, correct?

opened by oulenz 1
Small fix install progress

One of the extra compile arguments in setup.py seemed to prevent successful installation on multiple systems. Simply removing this argument seems to resolve this with no negative implications. The argument seems to try and force the compiler to run in c++11. Unsure if this was even present on the tested systems

opened by Dainean 1
Update eif.py

Goal: for more convenient usage Inspired by the tutorial document, I added two functions, outlier_pred and outlier_index into iForest, which returns the outlier prediction index and label matrix.

opened by MaiRajborirug 0
How to save the eif Model?

I am trying to save the model using pickle.dump() but this not working. How do I save the eif model? Please provide me a solution as I am stuck with this problem. Thank you.

opened by SanthanaMano 0
module 'eif' has no attribute '__version__'

i install eif by "pip install eif" and Successfully installed eif-2.0.2 but when i use eif.iForest arise attributeError: module 'eif' has no attribute 'version'

opened by wererLinC 0
I can't install eif 2.0.2, please tell me the reason

(base) C:\Users\22393\eif-2.0.2\eif-2.0.2>python setup.py install running install running bdist_egg running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib running build_py running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IE:\ProgramFiles\anaconda\lib\site-packages\numpy\core\include -IE:\ProgramFiles\anaconda\include -IE:\ProgramFiles\anaconda\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.8\Release_eif.obj -Wcpp cl: 命令行 error D8021 :无效的数值参数“/Wcpp” error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe' failed with exit status 2

opened by whmwhm123 0
Unable to install eif2.0.2
Dear Team, I am getting below error while trying install eif2.02 . Methods tried:

pip install eif

Downloaded eif tar file from pypi.org and tried installing

Downloaded the code from github and tried installing

In one of the issue it is mentioned to edit setup.py file(Remove the extra_compile line) and executed

failed in all above methods, Below is the error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-wheel-wjzxwp64' --python-tag cp37: ERROR: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext cythoning _eif.pyx to _eif.cpp building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

ERROR: Failed building wheel for eif Running setup.py clean for eif Failed to build eif Installing collected packages: eif Running setup.py install for eif ... error ERROR: Complete output from command 'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile: ERROR: running install running build running build_py creating build creating build\lib.win-amd64-3.7 copying eif_old.py -> build\lib.win-amd64-3.7 copying version.py -> build\lib.win-amd64-3.7 running egg_info writing eif.egg-info\PKG-INFO writing dependency_links to eif.egg-info\dependency_links.txt writing requirements to eif.egg-info\requires.txt writing top-level names to eif.egg-info\top_level.txt reading manifest file 'eif.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'eif.egg-info\SOURCES.txt' running build_ext skipping '_eif.cpp' Cython extension (up-to-date) building 'eif' extension creating build\temp.win-amd64-3.7 creating build\temp.win-amd64-3.7\Release C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Anaconda3\lib\site-packages\numpy\core\include -IC:\Anaconda3\include -IC:\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\8.1\include\shared" "-IC:\Program Files (x86)\Windows Kits\8.1\include\um" "-IC:\Program Files (x86)\Windows Kits\8.1\include\winrt" /EHsc /Tp_eif.cpp /Fobuild\temp.win-amd64-3.7\Release_eif.obj -Wcpp cl : Command line error D8021 : invalid numeric argument '/Wcpp' error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2 ---------------------------------------- ERROR: Command "'C:\Anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\XSVIJA~1\AppData\Local\Temp\pip-record-f8wv7_fl\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\XSVIJA~1\AppData\Local\Temp\pip-req-build-rqacf45o\

Please help.
opened by botlavijaykumar 1
Effect of feature scaling
Hi thanks for the great package (and example notebooks!). My issue is summarised in two points:

It appears that feature scale influences the orientation of the hyperplane splits in the trees, resulting in a poor anomaly score map.

Is this expected behaviour? If so, can anyone offer an explanation as to how this comes about as it seems from the paper that the orientation of all hyperplanes are random.

The following illustrates this further:

I have noticed that the extended forest shows odd results when applied to features with very different scales. For example if I draw 2D points from 2 normal distributions with variance 1 and 1000 and plot the contour maps comparing the regular iForest and the extended we see the contours become horizontal and the heat map in general is not good compared to the regular iForest.

It seems as though the choice of hyperplane gets biased towards horizontal lines. This is also notable in the examples given in the paper (figure 9) where 3 plots of tree splits are shown: Here we see the first two examples (a and b) the x and y values of the data lie on the same scale and the splits look randomly orientated. However in c) the x scale of the data is much larger than y scale, and most splits look more vertical. As a result we seen areas of higher anomaly score above and below the point cloud in the resulting heat map:

This issue is easily fixed by simply scaling all features before using the forest. However I was wondering if the splits are done on a hyperplane of random orientation why/how does feature scale influence the orientation of splits in each tree?

Apologies if I am missing something obvious, any insight would be useful, thanks!
opened by felixcaz 0

Releases(v2.0.2)

v2.0.2(Nov 14, 2019)

Cxx Implementation, much faster, same results
Source code(tar.gz)
Source code(zip)
v1.0.2(Oct 1, 2018)

Source code(tar.gz)
Source code(zip)
v1.0.1(Aug 8, 2018)

Source code(tar.gz)
Source code(zip)
v1.0.0(Jul 15, 2018)

Extended Isolation Forest

Initial Release
Source code(tar.gz)
Source code(zip)

Owner

Sahand Hariri

GitHub Repository

CobraML: Completely Customizable A python ML library designed to give the end user full control

CobraML: Completely Customizable What is it? CobraML is a python library built on both numpy and numba. Unlike other ML libraries CobraML gives the us

14 Dec 19, 2021

A data preprocessing package for time series data. Design for machine learning and deep learning.

152 Jan 07, 2023

Real-time stream processing for python

Streamz Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelin

1.1k Dec 28, 2022

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 08, 2023

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

1.5k Jan 02, 2023

This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

ARM This jupyter notebook project was completed by me and my friend using the dataset from Kaggle. The world Happiness 2017, which ranks 155 countries

1 Jan 23, 2022

Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

formulae formulae is a Python library that implements Wilkinson's formulas for mixed-effects models. The main difference with other implementations li

34 Dec 21, 2022

A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

2 Dec 10, 2021

A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

1 Feb 02, 2022

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Intel(R) Extension for Scikit-learn* Installation | Documentation | Examples | Support | FAQ With Intel(R) Extension for Scikit-learn you can accelera

858 Dec 25, 2022

A collection of machine learning examples and tutorials.

machine_learning_examples A collection of machine learning examples and tutorials.

7.1k Jan 01, 2023

Forecast dynamically at scale with this unique package. pip install scalecast

🌄 Scalecast: Dynamic Forecasting at Scale About This package uses a scaleable forecasting approach in Python with common scikit-learn and statsmodels

158 Jan 03, 2023

A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

pyUpSet A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al. Contents Purpose How to install How it work

288 Jan 04, 2023

Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

PyTASER PyTASER is a Python (3.9+) library and set of command-line tools for classifying spectral features in bulk materials, post-DFT. The goal of th

4 Dec 27, 2022

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

366 Jan 03, 2023

Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

1 Jan 23, 2022

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

8.9k Jan 09, 2023

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

6 Dec 04, 2022

Machine Learning for RC Cars

Suiron Machine Learning for RC Cars Prediction visualization (green = actual, blue = prediction) Click the video below to see it in action! Dependenci

706 Jan 02, 2023

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022

Extended Isolation Forest for Anomaly Detection

Related tags

Overview

Table of contents

Extended Isolation Forest

Summary

Motivation

Isolation Forest

Extension

The Code

Installation

Requirements

Use

Citation

Releases

v2.0.2

2019-NOV-14

v1.0.2

2018-OCT-01

v1.0.1

2018-AUG-08

v1.0.0

2018-JUL-15

Comments

Releases(v2.0.2)

v2.0.2(Nov 14, 2019)

v1.0.2(Oct 1, 2018)

v1.0.1(Aug 8, 2018)

v1.0.0(Jul 15, 2018)

Extended Isolation Forest

Initial Release

Owner

Sahand Hariri

CobraML: Completely Customizable A python ML library designed to give the end user full control

A data preprocessing package for time series data. Design for machine learning and deep learning.

Real-time stream processing for python

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

Formulae is a Python library that implements Wilkinson's formulas for mixed-effects models.

A collection of neat and practical data science and machine learning projects

A machine learning model for Covid case prediction

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

A collection of machine learning examples and tutorials.

Forecast dynamically at scale with this unique package. pip install scalecast

A pure-python implementation of the UpSet suite of visualisation methods by Lex, Gehlenborg et al.

Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Decision tree is the most powerful and popular tool for classification and prediction

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Machine Learning for RC Cars

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.