This tool parses log data and allows to define analysis pipelines for anomaly detection.

Overview

logdata-anomaly-miner Build Status DeepSource

This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis with limited resources and lowest possible permissions to make it suitable for production server use.

AECID Demo – Anomaly Detection with aminer and Reporting to IBM QRadar

Requirements

In order to install logdata-anomaly-miner a Linux system with python >= 3.6 is required. Debian-based distributions are currently recommended.

See requirements.txt for further module dependencies

Installation

Debian

There are Debian packages for logdata-anomaly-miner in the official Debian/Ubuntu repositories.

apt-get update && apt-get install logdata-anomaly-miner

From source

The following command will install the latest stable release:

cd $HOME
wget https://raw.githubusercontent.com/ait-aecid/logdata-anomaly-miner/main/scripts/aminer_install.sh
chmod +x aminer_install.sh
./aminer_install.sh

Docker

For installation with Docker see: Deployment with Docker

Getting started

Here are some resources to read in order to get started with configurations:

Publications

Publications and talks:

A complete list of publications can be found at https://aecid.ait.ac.at/further-information/.

Contribution

We're happily taking patches and other contributions. Please see the following links for how to get started:

Bugs

If you encounter any bugs, please create an issue on Github.

Security

If you discover any security-related issues read the SECURITY.md first and report the issues.

License

GPL-3.0

Comments
  • Multiline support

    Multiline support

    Since issue 372 was closed, I open a new issue for multiline support. See https://github.com/ait-aecid/logdata-anomaly-miner/issues/372

    As I mentioned in the issue, it would be good to have an optional EOL parameter in the config to support simple multiline logs that are clearly separable, e.g., by \n\n that otherwise does not occur. We could also think about supporting more advanced multiline logs, in particular, json formatted logs where each json object spans over several lines rather than a single line. This could be solved by counting brackets, i.e., the ByteStreamAtomizer increases a counter (initially set to 0) for every "{" and decreases it for every "}" (or any other user-defined characters), and passes a log_atom to the parser every time this counter reaches 0.

    enhancement 
    opened by landauermax 15
  • Allowlist and blocklist for detector path lists

    Allowlist and blocklist for detector path lists

    allowlisted_paths in ECD should be named blocklisted_paths, since these paths are not considered for detection.

    allowlisted_paths should also exist, but does the oppsite: Only when one of the paths in the logatom match dictionary contains one of the allowlisted_paths, analysis should be carried out.

    The attribute paths should overrule these lists.

    This feature should be available for all detectors that may be analyzing all available parser matches, such as the VTD.

    enhancement 
    opened by landauermax 15
  • Fix import warnings

    Fix import warnings

    /usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning: can't resolve package from spec or package, falling back on name and path

    return f(*args, **kwds)

    should not occur, when running the aminer.

    bug 
    opened by 4cti0nfi9ure 15
  • %z makes parsing way too slow

    %z makes parsing way too slow

    When using the %z in the parsing model (see slow.txt), I get around 50 lines per second. Without it I get around 1000 lines per second (see fast.txt). There is something wrong with parsing %z in the DateTimeModelElement.

    fast.txt slow.txt train.log config.py.txt

    bug high 
    opened by landauermax 12
  • added nullable functionality to JsonModelElements.

    added nullable functionality to JsonModelElements.

    Make sure these boxes are signed before submitting your Pull Request -- thank you.

    Must haves

    • [x] I have read and followed the contributing guide lines at https://github.com/ait-aecid/logdata-anomaly-miner/wiki/Git-development-workflow
    • [x] Issues exist for this PR
    • [x] I added related issues using the "Fixes #"-notations
    • [x] This Pull-Requests merges into the "development"-branch

    Fixes #1061 Fixes #1074

    Submission specific

    • [ ] This PR introduces breaking changes
    • [ ] My change requires a change to the documentation
    • [ ] I have updated the documentation accordingly
    • [ ] I have added tests to cover my changes
    • [ ] All new and existing tests passed

    Describe changes:

    opened by ernstleierzopf 11
  • Create backups of persistency

    Create backups of persistency

    There should be a parameter for the command line that backups the persistency in regular intervals. Also, there should be a command for the remote control that saves the persistency when executed.

    The persistency should be copied into a directory /var/lib/aminer/backup/yyyy-mm-dd-hh-mm-ss/...

    There should also be the possibility to restore configs, by remote control, config settings, etc.

    enhancement 
    opened by landauermax 11
  • Tabs in logs

    Tabs in logs

    My log file contains tabulators (e.g. System name:\tTESTNAME). However, the byte strings in the parsing models cannot interpret these tabulators (\t): FixedDataModelElement('fixed1', b'System name:\t'),

    How can I make it possible for the tabs to be interpreted correctly?

    opened by tschohanna 10
  • Add overall output for aminer

    Add overall output for aminer

    There should be a way to write everything that the AMiner outputs in a file. For example, in the beginning of the config, a parameter StandardOutput: "/etc/aminer/output.txt" can be set, where all the output (anomalies, errors, etc) is written to in addition to the usual output components. By default, it should be None and not write anything.

    enhancement 
    opened by landauermax 10
  • Warning if two detectors persist on same file

    Warning if two detectors persist on same file

    It is possible to define two detectors of the same type that will end up persisting in the same file - this can especially happen by accident, when the "Default" name is used. We should not prevent it completely, but at least print a warning when two or more detectors persist on the same file.

    enhancement 
    opened by landauermax 9
  • AtomFilterMatchAction YAML support

    AtomFilterMatchAction YAML support

    There should be a way to use a MatchRule so that only logs that match are forwarded to a specific detector, using the AtomFilterMatchAction. This can be done in python configs, but not in yaml configs. Also, tests and documentation is missing.

    enhancement high 
    opened by landauermax 8
  • Paths to JSON list elements

    Paths to JSON list elements

    I have this sample data:

    [email protected]:/home/ubuntu# cat file3.log 
    {"a": ["success", "a.png"]}
    {"a": ["success", "b.png"]}
    {"a": ["fail", "c.png"]}
    {"a": ["success", "c.png"]}
    

    The values in the list should be detected with a value detector. They should not be mixed, i.e., the first and second element in the list are independent.

    I use the following config to parse the file:

    LearnMode: True
    
    LogResourceList:
      - "file:///home/ubuntu/file3.log"
    
    Parser:  
           - id: x
             type: VariableByteDataModelElement
             name: 'x'
             args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'
    
           - id: json
             start: True
             type: JsonModelElement
             name: 'model'
             key_parser_dict:
               "a": 
                 - x
    
    Input:
            timestamp_paths: None
            verbose: True
            json_format: True
    
    Analysis:
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/x'
              learn_mode: true
              persistence_id: test
    
    EventHandlers:
            - id: stpe
              json: true
              type: StreamPrinterEventHandler
    

    Note that I use a value detector on the list. The result is as follows:

    [email protected]:/home/ubuntu# cat /var/lib/aminer/NewMatchPathValueDetector/test 
    ["bytes:a.png", "bytes:c.png", "bytes:b.png"]
    

    Only the last value has been learned, but I also want to learn the first element in the array.

    I propose to model all elements of the lists as their own elements, so that the parser looks like this:

    Parser:
           - id: y
             type: FixedWordlistDataModelElement
             name: 'y'
             args:
               - 'success'
               - 'fail'
                 
           - id: x
             type: VariableByteDataModelElement
             name: 'x'
             args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'
    
           - id: json
             start: True
             type: JsonModelElement
             name: 'model'
             key_parser_dict:
               "a": 
                 - y
                 - x
    

    and the analysis could look like this, where each element can be addressed individually by an analysis component:

    Analysis:
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/x'
              learn_mode: true
              persistence_id: test
    
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/y'
              learn_mode: true
              persistence_id: test
    

    The current implementation uses a single element to model all elements of the list. This can also be convenient and should be possible by introducing a new element called ListOfElements. It should parse any number of elements in the list with the specified parsing model element. For example, the list of elements here is a list of variable byte elements:

    Parser:
           - id: loe
             type: ListOfElements
             name: 'loe'
             args: z
                 
           - id: z
             type: VariableByteDataModelElement
             name: 'z'
             args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'
    
           - id: json
             start: True
             type: JsonModelElement
             name: 'model'
             key_parser_dict:
               "a": 
                 - loe
    

    The ListOfElements element should then assign the index of the element in the JSON list at the end of the path. For example, the following paths can be used in the analysis section:

    Analysis:
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/loe/0'
              learn_mode: true
              persistence_id: test
    
            - id: vd
              type: NewMatchPathValueDetector
              paths:
                  - '/model/loe/1'
              learn_mode: true
              persistence_id: test
    
    enhancement medium 
    opened by landauermax 8
  • extended FrequencyDetector wiki tests.

    extended FrequencyDetector wiki tests.

    Make sure these boxes are signed before submitting your Pull Request -- thank you.

    Must haves

    • [x] I have read and followed the contributing guide lines at https://github.com/ait-aecid/logdata-anomaly-miner/wiki/Git-development-workflow
    • [x] Issues exist for this PR
    • [x] I added related issues using the "Fixes #"-notations
    • [x] This Pull-Requests merges into the "development"-branch

    Fixes #1008 Fixes #1009

    Submission specific

    • [ ] This PR introduces breaking changes
    • [ ] My change requires a change to the documentation
    • [ ] I have updated the documentation accordingly
    • [ ] I have added tests to cover my changes
    • [ ] All new and existing tests passed

    Describe changes:

    opened by ernstleierzopf 0
  • fixed test26 so no fix definition number has to be added.

    fixed test26 so no fix definition number has to be added.

    Make sure these boxes are signed before submitting your Pull Request -- thank you.

    Must haves

    • [x] I have read and followed the contributing guide lines at https://github.com/ait-aecid/logdata-anomaly-miner/wiki/Git-development-workflow
    • [x] Issues exist for this PR
    • [x] I added related issues using the "Fixes #"-notations
    • [x] This Pull-Requests merges into the "development"-branch

    Fixes #1181

    Submission specific

    • [ ] This PR introduces breaking changes
    • [ ] My change requires a change to the documentation
    • [ ] I have updated the documentation accordingly
    • [ ] I have added tests to cover my changes
    • [ ] All new and existing tests passed

    Describe changes:

    opened by ernstleierzopf 0
  • Random test fails when new detector is added

    Random test fails when new detector is added

    When adding a new detector and running the tests, they usually fail at test26_filter_config_errors in YamlConfigTest.py as there is an integer that needs to be incremented. For example, see PR #1180 where this had to be fixed when adding a new detector. It is hard to spot why this test fails as it has nothing to do with the added detector and it is not an indicator of something that needs to be fixed. I therefore suggest to modify this test case so that no matter what integer comes after the "definition" keyword, the test passes. Then adding new detectors in the future should not make it necessary to always update this test.

    test medium 
    opened by landauermax 0
  • Add possibility to run some LogResources as json input and some as normal text input.

    Add possibility to run some LogResources as json input and some as normal text input.

    LogResourceList:
    
       - url: "file:///var/log/apache2/access.log"
       - url: "unix:///var/lib/akafka/aminer.sock"
         type: json  # Konfiguriert den ByteStream
         parser_id: kafka_audit_logs  # Konfiguriert den zugehörigen Parser
    
    
    Parser:
       - id: kafka_audit_logs
         type: AuditDingsParser
    
       - id: ApacheAccessModel
         start: true
    
    opened by ernstleierzopf 0
  • Shorten the build-time for docker builds

    Shorten the build-time for docker builds

    Currently the complete docker image is build at once. This takes a lot of time for each build. We could shorten the build time by inheriting from a pre-built image.

    enhancement 
    opened by whotwagner 0
Releases(V2.5.1)
  • V2.5.1(May 17, 2022)

    Bugfixes:

    • EFD: Fixed problem that appears with empty windows
    • Fixed index out of range if matches are empty in JsonModelElement array.
    • EFD: Fixed problem that appears with empty windows
    • EFD: Enabled immediate detection without training, if both limits are set
    • EFD: Fixed bug related to auto_include_flag
    • Remove spaces in aminer logo
    • ParserCounter: Fixed do_timer
    • Fixed code to allow the usage of AtomFilterMatchAction in yaml configs
    • Fixed JsonModelElement when json object is null
    • Fix incorrect message of charset detector
    • Fix match list handling for json objects
    • Fix incorrect message of charset detector

    Changes:

    • Added nullable functionality to JsonModelElements
    • Added include-directive to supervisord.conf
    • ETD: Output warning when count first exceeds range
    • EFD: Added option to output anomaly when the count first exceeds the range
    • VTD: Added variable type 'range'
    • EFD: Added the function reset_counter
    • EFD: Added option to set the lower and upper limit of the range interval
    • Enhance EFD to consider multiple time windows
    • VTD: Changed the value of parameter num_updates_until_var_reduction to track all variables from False to 0.
    • PAD: Used the binom_test of the scipy package as test if the model should be reinitialized if too few anomalies occur than are expected
    • Add ParsedLogAtom to aminer parser to ensure compatibility with lower versions
    • Added script to add build-id to the version-string
    • Support for installations from source in install-script
    • Fixed and stadardize the persistence time of various detectors
    • Refactoring
    • Improve performance
    • Improve output handling
    • Improved testing
    Source code(tar.gz)
    Source code(zip)
  • V2.5.0(Dec 6, 2021)

    Bugfixes:

    • Fixed bug in YamlConfig

    Changes:

    • Added supervisord to docker
    • Moved unparsed atom handlers to analysis(yamlconfig)
    • Moved new_match_path_detector to analysis(yamlconfig)
    • Refactor: merged all UnparsedHandlers into one python-file
    • Added remotecontrol-command for reopening eventhandlers
    • Added config-parameters for logrotation
    • Improved testing
    Source code(tar.gz)
    Source code(zip)
  • V2.4.2(Nov 24, 2021)

    Bugfixes:

    • PVTID: Fixed output format of previously appeared times
    • VTD: Fixed bugs (static -> discrete)
    • VTD: Fixed persistency-bugs
    • Fixed %z performance issues
    • Fixed error where optional keys with an array type are not parsed when being null
    • Fixed issues with JasonModelElement
    • Fixed persistence handling for ValueRangeDetector
    • PTSAD: Fixed a bug, which occurs, when the ETD stops saving the values of one analyzed path
    • ETD: Fixed the problem when entries of the match_dictionary are not of type MatchElement
    • Fixed error where json data instead of array was parsed successfully.

    Changes:

    • Added multiple parameters to VariableCorrelationDetector
    • Improved VTD
    • PVTID: Renamed parameter time_window_length to time_period_length
    • PVTID: Added check if atom time is None
    • Enhanced output of MTTD and PVTID
    • Improved docker-compose-configuration
    • Improved testing
    • Enhanced PathArimaDetector
    • Improved documentation
    • Improved KernelMsgParsingModel
    • Added pretty print for json output
    • Added the PathArimaDetector
    • TSA: Added functionality to discard arima models with too few log lines per time step
    • TSA: improved confidence calculation
    • TSA: Added the option to force the period length
    • TSA: Automatic selection of the pause area of the ACF
    • Extended EximGenericParsingModel
    • Extended AudispdParsingModel
    Source code(tar.gz)
    Source code(zip)
  • V2.4.1(Jul 23, 2021)

    Bugfixes:

    • Fixed issues with array of arrays in JsonParser
    • Fixed problems with invalid json-output
    • Fixed ValueError in DTME
    • Fixed error with parsing floats in scientific notation with the JsonModelElement.
    • Fixed issue with paths in JsonModelElement
    • Fixed error with \x encoded json
    • Fixed error where EMPTY_ARRAY and EMPTY_OBJECT could not be parsed from the yaml config
    • Fixed a bug in the TSA when encountering a new event type
    • Fixed systemd script
    • Fixed encoding errors when reading yaml configs

    Changes:

    • Add entropy detector
    • Add charset detector
    • Add value range detector
    • Improved ApacheAccessModel, AudispdParsingModel
    • Refactoring
    • Improved documentation
    • Improved testing
    • Improved schema for yaml-config
    • Added EMPTY_STRING option to the JsonModelElement
    • Implemented check to report unparsed atom if ALLOW_ALL is used with data with a type other than list or dict
    Source code(tar.gz)
    Source code(zip)
  • V2.4.0(Jun 10, 2021)

    Bugfixes:

    • Fixed error in JsonModelElement
    • Fixed problems with umlauts in JsonParser
    • Fixed problems with the start element of the ElementValueBranchModelElement
    • Fixed issues with the stat and debug command line parameters
    • Fixed issues if posix acl are not supported by the filesystem
    • Fixed issues with output for non ascii characters
    • Modified kafka-version

    Changes:

    • Improved command-line-options install-script
    • Added documentation
    • Improved VTD CM-Test
    • Improved unit-tests
    • Refactoring
    • Added TSAArimaDetector
    • Improved ParserCount
    • Added the PathValueTimeIntervalDetector
    • Implemented offline mode
    • Added PCA detector
    • Added timeout-paramter to ESD
    Source code(tar.gz)
    Source code(zip)
  • V2.3.1(Apr 8, 2021)

  • V2.3.0(Mar 31, 2021)

    Bugfixes:

    • Changed pyyaml-version to 5.4
    • NewMatchIdValueComboDetector: Fix allow multiple values per id path
    • ByteStreamLineAtomizer: fixed encoding error
    • Fixed too many open directory-handles
    • Added close() function to LogStream

    Changes:

    • Added EventFrequencyDetector
    • Added EventSequenceDetector
    • Added JsonModelElement
    • Added tests for Json-Handling
    • Added command line parameter for update checks
    • Improved testing
    • Splitted yaml-schemas into multiple files
    • Improved support for yaml-config
    • YamlConfig: set verbose default to true
    • Various refactoring
    Source code(tar.gz)
    Source code(zip)
  • V2.2.3(Feb 5, 2021)

  • V2.2.2(Jan 29, 2021)

  • V2.2.1(Jan 26, 2021)

    Bugfixes:

    • Fixed warnigs due to files in Persistency-Directory
    • Fixed ACL-problems in dockerfile and autocreate /var/lib/aminer/log

    Changes:

    • Added simple test for dockercontainer
    • Negate result of the timeout-command. 1 is okay. 0 must be an error
    • Added bullseye-tests
    • Make tmp-dir in debian-bullseye-test and debian-buster-test unique
    Source code(tar.gz)
    Source code(zip)
  • V2.2.0(Dec 23, 2020)

    Changes:

    • Added Dockerfile
    • Addes checks for acl of persistency directory
    • Added VariableCorrelationDetector
    • Added tool for managing multiple persistency files
    • Added supress-list for output
    • Added suspend-mode to remote-control
    • Added requirements.txt
    • Extended documentation
    • Extended yaml-configuration-support
    • Standardize command line parameters
    • Removed --Forground cli parameter
    • Fixed Security warnings by removing functions that allow race-condition
    • Refactoring
    • Ethical correct naming of variables
    • Enhanced testing
    • Added statistic outputs
    • Enhanced status info output
    • Changed global learn_mode behavior
    • Added RemoteControlSocket to yaml-config
    • Reimplemented the default mailnotificationhandler

    Bugfixes:

    • Fixed typos in documentation
    • Fixed issue with the AtomFilter in the yaml-config
    • Fixed order of ETD in yaml-config
    • Fixed various issues in persistency
    Source code(tar.gz)
    Source code(zip)
  • V2.1.0(Nov 5, 2020)

    • Changes:
      • Added VariableTypeDetector,EventTypeDetector and EventCorrelationDetector
      • Added support for unclean format strings in the DateTimeModelElement
      • Added timezones to the DateTimeModelElement
      • Enhanced ApacheAccessModel
      • Yamlconfig: added support for kafka stream
      • Removed cpu limit configuration
      • Various refactoring
      • Yamlconfig: added support for more detectors
      • Added new command-line-parameters
      • Renamed executables to aminer.py and aminerremotecontroly.py
      • Run Aminer in forgroundd-mode per default
      • Added various unit-tests
      • Improved yamlconfig and checks
      • Added start-config for parser to yamlconfig
      • Renamed config templates
      • Removed imports from init.py for better modularity
      • Created AnalysisComponentsPerformanceTests for the EventTypeDetector
      • Extended demo-config
      • Renamed whitelist to allowlist
      • Added warnings for non-existent resources
      • Changed default of auto_include_flag to false
    • Bugfixes:
      • Fixed some exit() in forks
      • Fixed debian files
      • Fixed JSON output of the AffectedLogAtomValues in all detectors
      • Fixed normal output of the NewMatchPathValueDetector
      • Fixed reoccuring alerting in MissingMatchPathValueDetector
    Source code(tar.gz)
    Source code(zip)
  • V2.0.2(Jul 17, 2020)

    • Changes:
      • Added help parameters
      • Added help-screen
      • Added version parameter
      • Adden path and value filter
      • Change time model of ApacheAccessModel for arbitrary time zones
      • Update link to documentation
      • Added SECURITY.md
      • Refactoring
      • Updated man-page
      • Added unit-tests for loadYamlconfig
    • Bugfixes:
      • Fixed header comment type in schema file
      • Fix debian files
    Source code(tar.gz)
    Source code(zip)
  • V2.0.1(Jun 24, 2020)

    • Changes:
      • Updated documentation
      • Updated testcases
      • Updated demos
      • Updated debian files
      • Added copyright headers
      • Added executable bit to AMiner
    Source code(tar.gz)
    Source code(zip)
  • V2.0.0(May 29, 2020)

    • Changes:
      • Updated documentation
      • Added functions getNameByComponent and getIdByComponent to AnalysisChild.py
      • Update DefaultMailNotificationEventHandler.py to python3
      • Extended AMinerRemoteControl
      • Added support for configuration in yaml format
      • Refactoring
      • Added KafkaEventHandler
      • Added JsonConverterHandler
      • Added NewMatchIdValueComboDetector
      • Enabled multiple default timestamp paths
      • Added debug feature ParserCount
      • Added unit and integration tests
      • Added installer script
      • Added VerboseUnparsedHandler
    • Bugfixes including:
      • Fixed dependencies in Debian packaging
      • Fixed typo in various analysis components
      • Fixed import of ModelElementInterface in various parsing components
      • Fixed issues with byte/string comparison
      • Fixed issue in DecimalIntegerValueModelElement, when parsing integer including sign and padding character
      • Fixed unnecessary long blocking time in SimpleMultisourceAtomSync
      • Changed minum matchLen in DelimitedDataModelElement to 1 byte
      • Fixed timezone offset in ModuloTimeMatchRule
      • Minor bugfixes
    Source code(tar.gz)
    Source code(zip)
Owner
AECID
Automatic Event Correlation for Incident Detection
AECID
Extract data from a wide range of Internet sources into a pandas DataFrame.

pandas-datareader Up to date remote data access for pandas, works for multiple versions of pandas. Installation Install using pip pip install pandas-d

Python for Data 2.5k Jan 09, 2023
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 2 Nov 18, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

17 Jul 09, 2022
t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

tree-SNE t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in s

Isaac Robinson 61 Nov 21, 2022
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-

International Business Machines 293 Dec 29, 2022
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c

Kevin Schwarzwald 42 Nov 09, 2022
Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

ElasticBatch Elasticsearch buffer for collecting and batch inserting Python data and pandas DataFrames Overview ElasticBatch makes it easy to efficien

Dan Kaslovsky 21 Mar 16, 2022
A fast, flexible, and performant feature selection package for python.

linselect A fast, flexible, and performant feature selection package for python. Package in a nutshell It's built on stepwise linear regression When p

88 Dec 06, 2022
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

Institute for Complex Systems (ICS), Johannes Kepler University Linz 40 Dec 13, 2022
Gathering data of likes on Tinder within the past 7 days

tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age

Alex Carter 12 Jan 05, 2023
Exploratory Data Analysis for Employee Retention Dataset

Exploratory Data Analysis for Employee Retention Dataset Employee turn-over is a very costly problem for companies. The cost of replacing an employee

kana sudheer reddy 2 Oct 01, 2021
A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

lushi_script Introduction This script is to "SHUA" H1-2 map of Mercenaries mode of Hearthstone Installation Make sure you installed python=3.6. To in

210 Jan 02, 2023
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 03, 2023
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023