(Py)TOD: GPU-accelerated Outlier Detection via Tensor Operations

Deployment & Documentation & Stats & License

Background: Outlier detection (OD) is a key data mining task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection.

We propose TOD, a system for efficient and scalable outlier detection (OD) on distributed multi-GPU machines. A key idea behind TOD is decomposing OD applications into basic tensor algebra operations for GPU acceleration.

Citing TOD: Check out the design paper. If you use TOD in a scientific publication, we would appreciate citations to the following paper:

@article{zhao2021tod,
  title={TOD: GPU-accelerated Outlier Detection via Tensor Operations},
  author={Zhao, Yue and Chen, George H and Jia, Zhihao},
  journal={arXiv preprint arXiv:2110.14007},
  year={2021}
}

or:

Zhao, Y., Chen, G.H. and Jia, Z., 2021. TOD: GPU-accelerated Outlier Detection via Tensor Operations. arXiv preprint arXiv:2110.14007.

One Reason to Use It:

On average, TOD is 11 times faster than PyOD on a diverse group of OD algorithms!

If you need another reason: it can handle much larger datasets---more than a million sample OD within an hour!

GPU-accelerated Outlier Detection with 5 Lines of Code:

# train the COPOD detector
from pytod.models.knn import KNN
clf = KNN() # default GPU device is used
clf.fit(X_train)

# get outlier scores
y_train_scores = clf.decision_scores_  # raw outlier scores on the train data
y_test_scores = clf.decision_function(X_test)  # predict raw outlier scores on test

TOD is featured for:

Unified APIs, detailed documentation, and examples for the easy use (under construction)
More than 5 different OD algorithms and more are being added
The support of multi-GPU acceleration
Advanced techniques including provable quantization and automatic batching

Table of Contents:

Installation
Implemented Algorithms
A Motivating Example PyOD vs. PyTOD
Paper Reproducibility
Programming Model Interface
End-to-end Performance Comparison with PyOD

Installation

It is recommended to use pip for installation. Please make sure the latest version is installed, as PyTOD is updated frequently:

pip install pytod            # normal install
pip install --upgrade pytod  # or update if needed

Alternatively, you could clone and run setup.py file:

git clone https://github.com/yzhao062/pytod.git
cd pyod
pip install .

Required Dependencies:

Python 3.6+
mpmath
numpy>=1.13
torch>=1.7 (it is safer if you install by yourself)
scipy>=0.19.1
scikit_learn>=0.21
pyod>=1.0.4 (for comparison)

Implemented Algorithms

PyTOD toolkit consists of three major functional groups (to be cleaned up):

(i) Individual Detection Algorithms :

Type	Abbr	Algorithm	Year	Ref
Linear Model	PCA	Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes)	2003	¹
Proximity-Based	LOF	Local Outlier Factor	2000	²
Proximity-Based	COF	Connectivity-Based Outlier Factor	2002	³
Proximity-Based	HBOS	Histogram-based Outlier Score	2012	⁴
Proximity-Based	kNN	k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score)	2000	⁵
Proximity-Based	AvgKNN	Average kNN (use the average distance to k nearest neighbors as the outlier score)	2002	⁶
Proximity-Based	MedKNN	Median kNN (use the median distance to k nearest neighbors as the outlier score)	2002	⁷
Probabilistic	ABOD	Angle-Based Outlier Detection	2008	⁸
Probabilistic	COPOD	COPOD: Copula-Based Outlier Detection	2020	⁹
Probabilistic	FastABOD	Fast Angle-Based Outlier Detection using approximation	2008	¹⁰

Code is being released. Watch and star for the latest news!

A Motivating Example PyOD vs. PyTOD!

kNN example shows that how fast and how easy PyTOD is. Take the famous kNN outlier detection as an example:

Initialize a kNN detector, fit the model, and make the prediction.

from pytod.models.knn import KNN   # kNN detector

# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)

# if GPU is not available, use CPU instead
clf = KNN(device='cpu')
clf.fit(X_train)

Get the prediction results

# get the prediction label and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

On a simple laptop, let us see how fast it is in comparison to PyOD for 30,000 samples with 20 features

KNN-PyOD ROC:1.0, precision @ rank n:1.0
Execution time 11.26 seconds

KNN-PyTOD-GPU ROC:1.0, precision @ rank n:1.0
Execution time 2.82 seconds

KNN-PyTOD-CPU ROC:1.0, precision @ rank n:1.0
Execution time 3.36 seconds

It is easy to see, PyTOD shows both better efficiency than PyOD.

Paper Reproducibility

Datasets: OD benchmark datasets are available at datasets folder.

Scripts for reproducibility is available in reproducibility folder.

Cleanup is on the way!

Programming Model Interface

Complex OD algorithms can be abstracted into common tensor operators.

For instance, ABOD and COPOD can be assembled by the basic tensor operators.

End-to-end Performance Comparison with PyOD

Overall, it is much (on avg. 11 times) faster than PyOD takes way less run time.

Reference

Shyu, M.L., Chen, S.C., Sarinnapakorn, K. and Chang, L., 2003. A novel anomaly detection scheme based on principal component classifier. MIAMI UNIV CORAL GABLES FL DEPT OF ELECTRICAL AND COMPUTER ENGINEERING.↩
Breunig, M.M., Kriegel, H.P., Ng, R.T. and Sander, J., 2000, May. LOF: identifying density-based local outliers. ACM Sigmod Record, 29(2), pp. 93-104.↩
Tang, J., Chen, Z., Fu, A.W.C. and Cheung, D.W., 2002, May. Enhancing effectiveness of outlier detections for low density patterns. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 535-548. Springer, Berlin, Heidelberg.↩
Goldstein, M. and Dengel, A., 2012. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. In KI-2012: Poster and Demo Track, pp.59-63.↩
Ramaswamy, S., Rastogi, R. and Shim, K., 2000, May. Efficient algorithms for mining outliers from large data sets. ACM Sigmod Record, 29(2), pp. 427-438.↩
Angiulli, F. and Pizzuti, C., 2002, August. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery pp. 15-27.↩
Angiulli, F. and Pizzuti, C., 2002, August. Fast outlier detection in high dimensional spaces. In European Conference on Principles of Data Mining and Knowledge Discovery pp. 15-27.↩
Kriegel, H.P. and Zimek, A., 2008, August. Angle-based outlier detection in high-dimensional data. In KDD '08, pp. 444-452. ACM.↩
Li, Z., Zhao, Y., Botta, N., Ionescu, C. and Hu, X. COPOD: Copula-Based Outlier Detection. IEEE International Conference on Data Mining (ICDM), 2020.↩
Kriegel, H.P. and Zimek, A., 2008, August. Angle-based outlier detection in high-dimensional data. In KDD '08, pp. 444-452. ACM.↩

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
examples		examples
figs		figs
pytod		pytod
reproducibility		reproducibility
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
__init__.py		__init__.py
pypi_build_commands.txt		pypi_build_commands.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

yzhao062/pytod

Folders and files

Latest commit

History

Repository files navigation

(Py)TOD: GPU-accelerated Outlier Detection via Tensor Operations

One Reason to Use It:

Installation

Implemented Algorithms

A Motivating Example PyOD vs. PyTOD!

Paper Reproducibility

Programming Model Interface

End-to-end Performance Comparison with PyOD

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages