A PyTorch implementation of Deep SAD, a deep Semi-supervised Anomaly Detection method.

Overview

Deep SAD: A Method for Deep Semi-Supervised Anomaly Detection

This repository provides a PyTorch implementation of the Deep SAD method presented in our ICLR 2020 paper ”Deep Semi-Supervised Anomaly Detection”.

Citation and Contact

You find a PDF of the Deep Semi-Supervised Anomaly Detection ICLR 2020 paper on arXiv https://arxiv.org/abs/1906.02694.

If you find our work useful, please also cite the paper:

@InProceedings{ruff2020deep,
  title     = {Deep Semi-Supervised Anomaly Detection},
  author    = {Ruff, Lukas and Vandermeulen, Robert A. and G{\"o}rnitz, Nico and Binder, Alexander and M{\"u}ller, Emmanuel and M{\"u}ller, Klaus-Robert and Kloft, Marius},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://openreview.net/forum?id=HkgH0TEYwH}
}

If you would like get in touch, just drop us an email to [email protected].

Abstract

Deep approaches to anomaly detection have recently shown promising results over shallow methods on large and complex datasets. Typically anomaly detection is treated as an unsupervised learning problem. In practice however, one may have---in addition to a large set of unlabeled samples---access to a small pool of labeled samples, e.g. a subset verified by some domain expert as being normal or anomalous. Semi-supervised approaches to anomaly detection aim to utilize such labeled samples, but most proposed methods are limited to merely including labeled normal samples. Only a few methods take advantage of labeled anomalies, with existing deep approaches being domain-specific. In this work we present Deep SAD, an end-to-end deep methodology for general semi-supervised anomaly detection. We further introduce an information-theoretic framework for deep anomaly detection based on the idea that the entropy of the latent distribution for normal data should be lower than the entropy of the anomalous distribution, which can serve as a theoretical interpretation for our method. In extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10, along with other anomaly detection benchmark datasets, we demonstrate that our method is on par or outperforms shallow, hybrid, and deep competitors, yielding appreciable performance improvements even when provided with only little labeled data.

The need for semi-supervised anomaly detection

fig1

Installation

This code is written in Python 3.7 and requires the packages listed in requirements.txt.

Clone the repository to your machine and directory of choice:

git clone https://github.com/lukasruff/Deep-SAD-PyTorch.git

To run the code, we recommend setting up a virtual environment, e.g. using virtualenv or conda:

virtualenv

# pip install virtualenv
cd <path-to-Deep-SAD-PyTorch-directory>
virtualenv myenv
source myenv/bin/activate
pip install -r requirements.txt

conda

cd <path-to-Deep-SAD-PyTorch-directory>
conda create --name myenv
source activate myenv
while read requirement; do conda install -n myenv --yes $requirement; done < requirements.txt

Running experiments

We have implemented the MNIST, Fashion-MNIST, and CIFAR-10 datasets as well as the classic anomaly detection benchmark datasets arrhythmia, cardio, satellite, satimage-2, shuttle, and thyroid from the Outlier Detection DataSets (ODDS) repository (http://odds.cs.stonybrook.edu/) as reported in the paper.

The implemented network architectures are as reported in the appendix of the paper.

Deep SAD

You can run Deep SAD experiments using the main.py script.

Here's an example on MNIST with 0 considered to be the normal class and having 1% labeled (known) training samples from anomaly class 1 with a pollution ratio of 10% of the unlabeled training data (with unknown anomalies from all anomaly classes 1-9):

cd <path-to-Deep-SAD-PyTorch-directory>

# activate virtual environment
source myenv/bin/activate  # or 'source activate myenv' for conda

# create folders for experimental output
mkdir log/DeepSAD
mkdir log/DeepSAD/mnist_test

# change to source directory
cd src

# run experiment
python main.py mnist mnist_LeNet ../log/DeepSAD/mnist_test ../data --ratio_known_outlier 0.01 --ratio_pollution 0.1 --lr 0.0001 --n_epochs 150 --lr_milestone 50 --batch_size 128 --weight_decay 0.5e-6 --pretrain True --ae_lr 0.0001 --ae_n_epochs 150 --ae_batch_size 128 --ae_weight_decay 0.5e-3 --normal_class 0 --known_outlier_class 1 --n_known_outlier_classes 1;

Have a look into main.py for all possible arguments and options.

Baselines

We also provide an implementation of the following baselines via the respective baseline_<method_name>.py scripts: OC-SVM (ocsvm), Isolation Forest (isoforest), Kernel Density Estimation (kde), kernel Semi-Supervised Anomaly Detection (ssad), and Semi-Supervised Deep Generative Model (SemiDGM).

Here's how to run SSAD for example on the same experimental setup as above:

cd <path-to-Deep-SAD-PyTorch-directory>

# activate virtual environment
source myenv/bin/activate  # or 'source activate myenv' for conda

# create folder for experimental output
mkdir log/ssad
mkdir log/ssad/mnist_test

# change to source directory
cd src

# run experiment
python baseline_ssad.py mnist ../log/ssad/mnist_test ../data --ratio_known_outlier 0.01 --ratio_pollution 0.1 --kernel rbf --kappa 1.0 --normal_class 0 --known_outlier_class 1 --n_known_outlier_classes 1;

The autoencoder is provided through Deep SAD pre-training using --pretrain True with main.py. To then run a hybrid approach using one of the classic methods on top of autoencoder features, simply point to the saved autoencoder model using --load_ae ../log/DeepSAD/mnist_test/model.tar and set --hybrid True.

To run hybrid SSAD for example on the same experimental setup as above:

cd <path-to-Deep-SAD-PyTorch-directory>

# activate virtual environment
source myenv/bin/activate  # or 'source activate myenv' for conda

# create folder for experimental output
mkdir log/hybrid_ssad
mkdir log/hybrid_ssad/mnist_test

# change to source directory
cd src

# run experiment
python baseline_ssad.py mnist ../log/hybrid_ssad/mnist_test ../data --ratio_known_outlier 0.01 --ratio_pollution 0.1 --kernel rbf --kappa 1.0 --hybrid True --load_ae ../log/DeepSAD/mnist_test/model.tar --normal_class 0 --known_outlier_class 1 --n_known_outlier_classes 1;

License

MIT

Owner
Lukas Ruff
PhD student in the ML group at TU Berlin.
Lukas Ruff
Feature Store for Machine Learning

Overview Feast is an open source feature store for machine learning. Feast is the fastest path to productionizing analytic data for model training and

Feast 3.8k Dec 30, 2022
A Sublime Text plugin to select a default syntax dialect

Default Syntax Chooser This Sublime Text 4 plugin provides the set_default_syntax_dialect command. This command manipulates a syntax file (e.g.: SQL.s

3 Jan 14, 2022
Explain yourself! Interrogate a codebase for docstring coverage.

interrogate: explain yourself Interrogate a codebase for docstring coverage. Why Do I Need This? interrogate checks your code base for missing docstri

Lynn Root 435 Dec 29, 2022
A comprehensive and FREE Online Python Development tutorial going step-by-step into the world of Python.

FREE Reverse Engineering Self-Study Course HERE Fundamental Python The book and code repo for the FREE Fundamental Python book by Kevin Thomas. FREE B

Kevin Thomas 7 Mar 19, 2022
A Python Package To Generate Strong Passwords For You in Your Projects.

shPassGenerator Version 1.0.6 Ready To Use Developed by Shervin Badanara (shervinbdndev) on Github Language and technologies used in This Project Work

Shervin 11 Dec 19, 2022
The mitosheet package, trymito.io, and other public Mito code.

Mito Monorepo Mito is a spreadsheet that lives inside your JupyterLab notebooks. It allows you to edit Pandas dataframes like an Excel file, and gener

Mito 1.4k Dec 31, 2022
API spec validator and OpenAPI document generator for Python web frameworks.

API spec validator and OpenAPI document generator for Python web frameworks.

1001001 249 Dec 22, 2022
Pyoccur - Python package to operate on occurrences (duplicates) of elements in lists

pyoccur Python Occurrence Operations on Lists About Package A simple python package with 3 functions has_dup() get_dup() remove_dup() Currently the du

Ahamed Musthafa 6 Jan 07, 2023
A simple document management REST based API for collaboratively interacting with documents

documan_api A simple document management REST based API for collaboratively interacting with documents.

Shahid Yousuf 1 Jan 22, 2022
PyPresent - create slide presentations from notes

PyPresent Create slide presentations from notes Add some formatting to text file

1 Jan 06, 2022
A simple tutorial to get you started with Discord and it's Python API

Hello there Feel free to fork and star, open issues if there are typos or you have a doubt. I decided to make this post because as a newbie I never fo

Sachit 1 Nov 01, 2021
:blue_book: Automatic documentation from sources, for MkDocs.

mkdocstrings Automatic documentation from sources, for MkDocs. Features - Python handler - Requirements - Installation - Quick usage Features Language

1.1k Jan 04, 2023
Plugins for MkDocs.

Plugins for MkDocs and Python Markdown pip install neoteroi-mkdocs This package includes the following plugins and extensions: Name Description Type m

35 Dec 23, 2022
DeltaPy - Tabular Data Augmentation (by @firmai)

DeltaPy⁠⁠ — Tabular Data Augmentation & Feature Engineering Finance Quant Machine Learning ML-Quant.com - Automated Research Repository Introduction T

Derek Snow 470 Dec 28, 2022
My solutions to the Advent of Code 2021 problems in Go and Python 🎄

🎄 Advent of Code 2021 🎄 Summary Advent of Code is an annual Advent calendar of programming puzzles. This year I am doing it in Go and Python. Runnin

Orfeas Antoniou 16 Jun 16, 2022
Mayan EDMS is a document management system.

Mayan EDMS is a document management system. Its main purpose is to store, introspect, and categorize files, with a strong emphasis on preserving the contextual and business information of documents.

3 Oct 02, 2021
A next-generation curated knowledge sharing platform for data scientists and other technical professions.

Knowledge Repo The Knowledge Repo project is focused on facilitating the sharing of knowledge between data scientists and other technical roles using

Airbnb 5.2k Dec 27, 2022
Projeto em Python colaborativo para o Bootcamp de Dados do Itaú em parceria com a Lets Code

🧾 lets-code-todo-list por Henrique V. Domingues e Josué Montalvão Projeto em Python colaborativo para o Bootcamp de Dados do Itaú em parceria com a L

Henrique V. Domingues 1 Jan 11, 2022
A Python library that simplifies the extraction of datasets from XML content.

xmldataset: simple xml parsing 🗃️ XML Dataset: simple xml parsing Documentation: https://xmldataset.readthedocs.io A Python library that simplifies t

James Spurin 75 Dec 30, 2022
This program has been coded to allow the user to rename all the files in the entered folder.

Bulk_File_Renamer This program has been coded to allow the user to rename all the files in the entered folder. The only required package is "termcolor

1 Jan 06, 2022