One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Related tags

Text Data & NLPOSAS
Overview

One Stop Anomaly Shop (OSAS)

Quick start guide

Step 1: Get/build the docker image

Option 1: Use precompiled image (might not reflect latest changes):

docker pull tiberiu44/osas:latest
docker image tag tiberiu44/osas:latest osas:latest

Option 2: Build the image locally

git clone https://github.com/adobe/OSAS.git
cd OSAS
docker build . -f docker/osas-elastic/Dockerfile -t osas:latest

Step 2: After building the docker image you can start OSAS by typing:

docker run -p 8888:8888/tcp -p 5601:5601/tcp -v <ABSOLUTE PATH TO DATA FOLDER>:/app osas

IMPORTANT NOTE: Please modify the above command by adding the absolute path to your datafolder in the appropiate location

After OSAS has started (it might take 1-2 minutes) you can use your browser to access some standard endpoints:

For Debug (in case you need to):

docker run -p 8888:8888/tcp -p 5601:5601/tcp -v <ABSOLUTE PATH TO DATA FOLDER>:/app -ti osas /bin/bash

Building the test pipeline

This guide will take you through all the necessary steps to configure, train and run your own pipeline on your own dataset.

Prerequisite: Add you own CSV dataset into your data-folder (the one provided in the docker run command)

Once you started your docker image, use the OSAS console to gain CLI access to all the tools.

In what follows, we assume that your dataset is called dataset.csv. Please update the commands as necessary in case you use a different name/location.

Be sure you are running scripts in the root folder of OSAS:

cd /osas

Step 1: Build a custom pipeline configuration file - this can be done fully manually on by bootstraping using our conf autogenerator script:

python3 osas/main/autoconfig.py --input-file=/app/dataset.csv --output-file=/app/dataset.conf

The above command will generate a custom configuration file for your dataset. It will try guess field types and optimal combinations between fields. You can edit the generated file (which should be available in the shared data-folder), using your favourite editor.

Standard templates for label generator types are:

[LG_MULTINOMIAL]
generator_type = MultinomialField
field_name = <FIELD_NAME>
absolute_threshold = 10
relative_threshold = 0.1

[LG_TEXT]
generator_type = TextField
field_name = <FIELD_NAME>
lm_mode = char
ngram_range = (3, 5)

[LG_NUMERIC]
generator_type = NumericField
field_name = <FIELD_NAME>

[LG_MUTLINOMIAL_COMBINER]
generator_type = MultinomialFieldCombiner
field_names = ['<FIELD_1>', '<FIELD_2>', ...]
absolute_threshold = 10
relative_threshold = 0.1

[LG_KEYWORD]
generator_type = KeywordBased
field_name = <FIELD_NAME>
keyword_list = ['<KEYWORD_1>', '<KEYWORD_2>', '<KEYWORD_3>', ...]

[LG_REGEX]
generator_type = KnowledgeBased
field_name = <FIELD_NAME>
rules_and_labels_tuple_list = [('<REGEX_1>','<LABEL_1>'), ('<REGEX_2>','<LABEL_2>'), ...]

You can use the above templates to add as many label generators you want. Just make sure that the header IDs are unique in the configuration file.

Step 2: Train the pipeline

python3 osas/main/train_pipeline --conf-file=/app/dataset.conf --input-file=/app/dataset.csv --model-file=/app/dataset.json

The above command will generate a pretrained pipeline using the previously created configuration file and the dataset

Step 3: Run the pipeline on a dataset

python3 osas/main/run_pipeline --conf-file=/app/dataset.conf --model-file=/app/dataset.json --input-file=/app/dataset.csv --output-file=/app/dataset-out.csv

The above command will run the pretrained pipeline on any compatible dataset. In the example we run the pipeline on the training data, but you can use previously unseen data. It will generate an output file with labels and anomaly scores and it will also import your data into Elasticsearch/Kibana. To view the result just use the the web interface.

Pipeline explained

The pipeline sequentially applies all label generators on the raw data, collects the labels and uses an anomaly scoring algorithm to generate anomaly scores. There are two main component classes: LabelGenerator and ScoringAlgorithm.

Label generators

NumericField

  • This type of LabelGenerator handles numerical fields. It computes the mean and standard deviation and generates labels according to the distance between the current value and the mean value (value<=sigma NORMAL, sigma<value<=2sigma BORDERLINE, 2sigma<value OUTLIER)

Params:

  • field_name: what field to look for in the data object

TextField

  • This type of LabelGenerator handles text fields. It builds a n-gram based language model and computes the perplexity of newly observed data. It also holds statistics over the training data (mean and stdev). (perplexity<=sigma NORMAL, sigma<preplexity<=2sigma BORDERLINE, 2perplexity<value OUTLIER)

Params:

  • field_name: What field to look for
  • lm_mode: Type of LM to build: char or token
  • ngram_range: N-gram range to use for computation

MultinomialField

  • This type of LabelGenerator handles fields with discreet value sets. It computes the probability of seeing a specific value and alerts based on relative and absolute thresholds.

Params

  • field_name: What field to use
  • absolute_threshold: Minimum absolute value for occurrences to trigger alert for
  • relative_threshold: Minimum relative value for occurrences to trigger alert for

MultinomialFieldCombiner

  • This type of LabelGenerator handles fields with discreet value sets and build advanced features by combining values across the same dataset entry. It computes the probability of seeing a specific value and alerts based on relative and absolute thresholds.

Params

  • field_names: What fields to combine
  • absolute_threshold: Minimum absolute value for occurrences to trigger alert for
  • relative_threshold: Minimum relative value for occurrences to trigger alert for

KeywordBased

  • This is a rule-based label generators. It applies a simple tokenization procedure on input text, by dropping special characters and numbers and splitting on white-space. It then looks for a specific set of keywords and generates labels accordingly

Params:

  • field_name: What field to use
  • keyword_list: The list of keywords to look for

OSAS has four unsupervised anomaly detection algorithms:

  • IFAnomaly: n-hot encoding, singular value decomposition, isolation forest (IF)

  • LOFAnomaly: n-hot encoding, singular value decomposition, local outlier factor (LOF)

  • SVDAnomaly: n-hot encoding, singular value decomposition, inverted transform, input reconstruction error

  • StatisticalNGramAnomaly: compute label n-gram probabilities, compute anomaly score as a sum of negative log likelihood

Owner
Adobe, Inc.
Open source from Adobe
Adobe, Inc.
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
Course project of [email protected]

NaiveMT Prepare Clone this repository git clone [email protected]:Poeroz/NaiveMT.git

Poeroz 2 Apr 24, 2022
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
GPT-2 Model for Leetcode Questions in python

Leetcode using AI 🤖 GPT-2 Model for Leetcode Questions in python New demo here: https://huggingface.co/spaces/gagan3012/project-code-py Note: the Ans

Gagan Bhatia 100 Dec 12, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 49 Dec 30, 2022
Code for using and evaluating SpanBERT.

SpanBERT This repository contains code and models for the paper: SpanBERT: Improving Pre-training by Representing and Predicting Spans. If you prefer

Meta Research 798 Dec 30, 2022
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
Generate vector graphics from a textual caption

VectorAscent: Generate vector graphics from a textual description Example "a painting of an evergreen tree" python text_to_painting.py --prompt "a pai

Ajay Jain 97 Dec 15, 2022
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
Autoregressive Entity Retrieval

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch. @inproceedings{decao2020autoreg

Meta Research 611 Dec 16, 2022
Subtitle Workshop (subshop): tools to download and synchronize subtitles

SUBSHOP Tools to download, remove ads, and synchronize subtitles. SUBSHOP Purpose Limitations Required Web Credentials Installation, Configuration, an

Joe D 4 Feb 13, 2022
A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

c3rb3ru5 45 Jan 05, 2023
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
Beyond Accuracy: Behavioral Testing of NLP models with CheckList

CheckList This repository contains code for testing NLP Models as described in the following paper: Beyond Accuracy: Behavioral Testing of NLP models

Marco Tulio Correia Ribeiro 1.8k Dec 28, 2022
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Bethge Lab 61 Dec 21, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022