50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

Overview

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm still looking for new contributors! Please help make HyperLearn no1!!]

drawing

HyperLearn is what drives Umbra's AI engines. It is open source to everyone, everywhere, and we hope humanity can rise to the stars.

[Notice - I will be updating the package monthly or bi-weekly due to other commitments]


drawing https://hyperlearn.readthedocs.io/en/latest/index.html

Faster, Leaner GPU Sklearn, Statsmodels written in PyTorch

GitHub issues Github All Releases

50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels combo with new novel algorithms.

HyperLearn is written completely in PyTorch, NoGil Numba, Numpy, Pandas, Scipy & LAPACK, and mirrors (mostly) Scikit Learn. HyperLearn also has statistical inference measures embedded, and can be called just like Scikit Learn's syntax (model.confidence_interval_) Ongoing documentation: https://hyperlearn.readthedocs.io/en/latest/index.html

I'm also writing a mini book! A sneak peak: drawing

drawing

Comparison of Speed / Memory

Algorithm n p Time(s) RAM(mb) Notes
Sklearn Hyperlearn Sklearn Hyperlearn
QDA (Quad Dis A) 1000000 100 54.2 22.25 2,700 1,200 Now parallelized
LinearRegression 1000000 100 5.81 0.381 700 10 Guaranteed stable & fast

Time(s) is Fit + Predict. RAM(mb) = max( RAM(Fit), RAM(Predict) )

I've also added some preliminary results for N = 5000, P = 6000 drawing

Since timings are not good, I have submitted 2 bug reports to Scipy + PyTorch:

  1. EIGH very very slow --> suggesting an easy fix #9212 https://github.com/scipy/scipy/issues/9212
  2. SVD very very slow and GELS gives nans, -inf #11174 https://github.com/pytorch/pytorch/issues/11174

Help is really needed! Message me!


Key Methodologies and Aims

1. Embarrassingly Parallel For Loops

2. 50%+ Faster, 50%+ Leaner

3. Why is Statsmodels sometimes unbearably slow?

4. Deep Learning Drop In Modules with PyTorch

5. 20%+ Less Code, Cleaner Clearer Code

6. Accessing Old and Exciting New Algorithms


1. Embarrassingly Parallel For Loops

  • Including Memory Sharing, Memory Management
  • CUDA Parallelism through PyTorch & Numba

2. 50%+ Faster, 50%+ Leaner

3. Why is Statsmodels sometimes unbearably slow?

  • Confidence, Prediction Intervals, Hypothesis Tests & Goodness of Fit tests for linear models are optimized.
  • Using Einstein Notation & Hadamard Products where possible.
  • Computing only what is necessary to compute (Diagonal of matrix and not entire matrix).
  • Fixing the flaws of Statsmodels on notation, speed, memory issues and storage of variables.

4. Deep Learning Drop In Modules with PyTorch

  • Using PyTorch to create Scikit-Learn like drop in replacements.

5. 20%+ Less Code, Cleaner Clearer Code

  • Using Decorators & Functions where possible.
  • Intuitive Middle Level Function names like (isTensor, isIterable).
  • Handles Parallelism easily through hyperlearn.multiprocessing

6. Accessing Old and Exciting New Algorithms

  • Matrix Completion algorithms - Non Negative Least Squares, NNMF
  • Batch Similarity Latent Dirichelt Allocation (BS-LDA)
  • Correlation Regression
  • Feasible Generalized Least Squares FGLS
  • Outlier Tolerant Regression
  • Multidimensional Spline Regression
  • Generalized MICE (any model drop in replacement)
  • Using Uber's Pyro for Bayesian Deep Learning

Goals & Development Schedule

Will Focus on & why:

1. Singular Value Decomposition & QR Decomposition

* SVD/QR is the backbone for many algorithms including:
    * Linear & Ridge Regression (Regression)
    * Statistical Inference for Regression methods (Inference)
    * Principal Component Analysis (Dimensionality Reduction)
    * Linear & Quadratic Discriminant Analysis (Classification & Dimensionality Reduction)
    * Pseudoinverse, Truncated SVD (Linear Algebra)
    * Latent Semantic Indexing LSI (NLP)
    * (new methods) Correlation Regression, FGLS, Outlier Tolerant Regression, Generalized MICE, Splines (Regression)

On Licensing: HyperLearn is under a GNU v3 License. This means:

  1. Commercial use is restricted. Only software with 0 cost can be released. Ie: no closed source versions are allowed.
  2. Using HyperLearn must entail all of the code being avaliable to everyone who uses your public software.
  3. HyperLearn is intended for academic, research and personal purposes. Any explicit commercialisation of the algorithms and anything inside HyperLearn is strictly prohibited.

HyperLearn promotes a free and just world. Hence, it is free to everyone, except for those who wish to commercialise on top of HyperLearn. Ongoing documentation: https://hyperlearn.readthedocs.io/en/latest/index.html [As of 2020, HyperLearn's license has been changed to BSD 3]

Owner
Daniel Han-Chen
Fast energy efficient machine learning algorithms
Daniel Han-Chen
(3D): LeGO-LOAM, LIO-SAM, and LVI-SAM installation and application

SLAM-application: installation and test (3D): LeGO-LOAM, LIO-SAM, and LVI-SAM Tested on Quadruped robot in Gazebo ● Results: video, video2 Requirement

EungChang-Mason-Lee 203 Dec 26, 2022
Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map,

Mohamadreza Rezaei 1 Jan 19, 2022
Firebase + Cloudrun + Machine learning

A simple end to end consumer lending decision engine powered by Google Cloud Platform (firebase hosting and cloudrun)

Emmanuel Ogunwede 8 Aug 16, 2022
A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

Peter Corke 1.2k Jan 07, 2023
Pydantic based mock data generation

This library offers powerful mock data generation capabilities for pydantic based models. It can also be used with other libraries that use pydantic as a foundation, for example SQLModel, Beanie and

Na'aman Hirschfeld 396 Dec 28, 2022
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 09, 2023
Graphsignal is a machine learning model monitoring platform.

Graphsignal is a machine learning model monitoring platform. It helps ML engineers, MLOps teams and data scientists to quickly address issues with data and models as well as proactively analyze model

Graphsignal 143 Dec 05, 2022
Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along with material in the form of Jupyter Notebooks.

Databricks Certification Spark Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along

19 Dec 13, 2022
Create large-scale ML-driven multiscale simulation ensembles to study the interactions

MuMMI RAS v0.1 Released: Nov 16, 2021 MuMMI RAS is the application component of the MuMMI framework developed to create large-scale ML-driven multisca

4 Feb 16, 2022
Module for statistical learning, with a particular emphasis on time-dependent modelling

Operating system Build Status Linux/Mac Windows tick tick is a Python 3 module for statistical learning, with a particular emphasis on time-dependent

X - Data Science Initiative 410 Dec 14, 2022
Class-imbalanced / Long-tailed ensemble learning in Python. Modular, flexible, and extensible

IMBENS: Class-imbalanced Ensemble Learning in Python Language: English | Chinese/中文 Links: Documentation | Gallery | PyPI | Changelog | Source | Downl

Zhining Liu 176 Jan 04, 2023
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

154 Dec 17, 2022
Evaluate on three different ML model for feature selection using Breast cancer data.

Anomaly-detection-Feature-Selection Evaluate on three different ML model for feature selection using Breast cancer data. ML models: SVM, KNN and MLP.

Tarek idrees 1 Mar 17, 2022
The Fuzzy Labs guide to the universe of open source MLOps

Open Source MLOps This is the Fuzzy Labs guide to the universe of free and open source MLOps tools. Contents What is MLOps, anyway? Data version contr

Fuzzy Labs 352 Dec 29, 2022
Kalman filter library

The kalman filter framework described here is an incredibly powerful tool for any optimization problem, but particularly for visual odometry, sensor fusion localization or SLAM.

comma.ai 276 Jan 01, 2023
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
The Simpsons and Machine Learning: What makes an Episode Great?

The Simpsons and Machine Learning: What makes an Episode Great? Check out my Medium article on this! PROBLEM: The Simpsons has had a decline in qualit

1 Nov 02, 2021
This is the code repository for LRM Stochastic watershed model.

LRM-Squannacook Input data for generating stochastic streamflows are observed and simulated timeseries of streamflow. their format needs to be CSV wit

1 Feb 14, 2022
A concept I came up which ditches the idea of "layers" in a neural network.

Dynet A concept I came up which ditches the idea of "layers" in a neural network. Install Copy Dynet.py to your project. Run the example Install matpl

Anik Patel 4 Dec 05, 2021
CS 7301: Spring 2021 Course on Advanced Topics in Optimization in Machine Learning

CS 7301: Spring 2021 Course on Advanced Topics in Optimization in Machine Learning

Rishabh Iyer 141 Nov 10, 2022