Machine Learning Collection

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

LightGBM - A fast, distributed, high performance gradient boosting framework.
LightGBM benchmarking suite - Benchmark tools for LightGBM.
Explainable Boosting Machines - interpretable model developed in Microsoft Research using bagging, gradient boosting, and automatic interaction detection to estimated generalized additive models.
Cyclic Boosting Machines - An explainable supervised machine learning algorithm specifically for predicting count-data, such as sales and demand.

AutoML

Neural Network Intelligence - An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
Archai - Reproducible Rapid Research for Neural Architecture Search (NAS).
FLAML - A fast and lightweight AutoML library.
Azure Automated Machine Learning - Automated Machine Learning for Tabular data (regression, classification and forecasting) by Azure Machine Learning
Cream - A collection of Microsoft NAS and Vision Transformer work.

Neural Network

PyMarlin - Lightweight Deep Learning Model Training library based on PyTorch.
bayesianize - A Bayesian neural network wrapper in pytorch.
O-CNN - Octree-based convolutional neural networks for 3D shape analysis.
ResNet - deep residual network.
CNTK - microsoft cognitive toolkit (CNTK), open source deep-learning toolkit.
InfiniBatch - Efficient, check-pointed data loading for deep learning with massive data sets.
Models under Hugging Face - Microsoft shares transformer models at Hugging Face. 51 pretrained models (as of June 28, 2021).
Muzic - Music Understanding and Generation with Artificial Intelligence.

Graph & Network

graspologic - utilities and algorithms designed for the processing and analysis of graphs with specialized graph statistical algorithms.
TF Graph Neural Network Samples - tensorFlow implementations of graph neural networks.
ptgnn - PyTorch Graph Neural Network Library
StemGNN - spectral temporal graph neural network (StemGNN) for multivariate time-series forecasting.
SPTAG - a distributed approximate nearest neighborhood search (ANN) library.
DiskANN - Scalable graph based indices for approximate nearest neighbor search.

Vision

Microsoft Vision Model ResNet50 - a large pretrained vision ResNet-50 model using search engine's web-scale image data.
Oscar - Object-Semantics Aligned Pre-training for Vision-Language Tasks.
TorchGeo - a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.
Swin Transformer - an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

Time Series

luminol - anomaly detection and correlation library.
SR-CNN - Spectral Residual based anomaly detection algorithm, SR-CNN implementation.
Greykite - flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.
Temporal Cluster Matching for Change Detection of Structures from Satellite Imagery - An implementation of the temporal cluster matching method for detecting change in structure footprints from time series of remotely sensed imagery.
Microsoft Finance Time Series Forecasting Framework - a forecasting package that utilizes cutting-edge time series forecasting and parallelization on the cloud to produce accurate forecasts for financial data.
FOST - an easy-use tool for spatial-temporal forecasting

NLP

T-ULRv2 - Turing multilingual language model.
Turing-NLG - Turing Natural Language Generation, 17 billion-parameter language model.
DeBERTa - Decoding-enhanced BERT with Disentangled Attention
UniLM - Unified Language Model Pre-training / Pre-training for NLP and Beyond
Unicoder - Unicoder model for understanding and generation.
NeuronBlocks - building your nlp dnn models like playing lego
Multilingual Model Transfer - new deep learning models for bootstrapping language understanding models for languages with no labeled data using labeled data from other languages.
MT-DNN - multi-task deep neural networks for natural language understanding.
inmt - interactive neural machine trainslation-lite
OpenKP - automatically extracting keyphrases that are salient to the document meanings is an essential step in semantic document understanding.
DeText - a deep neural text understanding framework for ranking and classification tasks.
Genalog - an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.
FastFormers - highly efficient transformer models for NLU.
VERSEAGILITY - a Python-based toolkit to ramp up your custom natural language processing (NLP) task, allowing you to bring your own data and bring models into production. It is a central component of the Microsoft Data Science Toolkit.
DPU Utilities - Utilities used by the Deep Program Understanding team.
KEAR - Official code for achieving human parity on CommonsenseQA with External Attention.
Prompt Engine - A utility library for creating and maintaining prompts for Large Language Models.

Online Machine Learning

Vowpal Wabbit - fast, efficient, and flexible online machine learning techniques for reinforcement learning, supervised learning, and more.

Recommendation

Recommenders - examples and best practics for building recommendation systems (A2SVD, DKN, xDeepFM, LightGBM, LSTUR, NAML, NPA, NRMS, RLRMC, SAR, Vowpal Wabbit are invented/contributed by Microsoft).
GDMIX - A deep ranking personalization framework
rankerEval - A fast numpy-based implementation of ranking metrics for information retrieval and recommendation.

Distributed

DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
MMLSpark - machine learning library on spark.
photon-ml - a scalable machine learning library on apache spark.
TonY - framework to natively run deep learning frameworks on apache hadoop.
isolation-forest - A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.

Causal Inference

EconML - Python package for estimating heterogeneous treatment effects from observational data via machine learning.
DoWhy - Python library for causal inference that supports explicit modeling and testing of causal assumptions.

Responsible AI

InterpretML - a toolkit to help understand models and enable responsbile machine learning.
- Interpret Community - extends interpret repo with additional interpretability techniques and utility functions.
- DiCE - diverse counterfactual explanations.
- Interpret-Text - state-of-the-art explainers for text-based ml models and visualize with dashboard.
fairlearn - python package to assess and improve fairness of machine learning models.
LiFT - linkedin fairness toolkit.
RobustDG - Toolkit for building machine learning models that generalize to unseen domains and are robust to privacy and other attacks.
SHAP - a game theoretic approach to explain the output of any machine learning model (scott lundbert, Microsoft Research).
LIME - explaining the predictions of any machine learning classifier (Marco, Microsoft Research).
BackwardCompatibilityML - Project for open sourcing research efforts on Backward Compatibility in Machine Learning
confidential-ml-utils - Python utilities for training and deploying ML models against data you can't see.
presidio - context aware, pluggable and customizable data protection and anonymization service for text and images.
- Presidio-research - This package features data-science related tasks for developing new recognizers for Presidio.
Confidential ONNX Inference Server - An Open Enclave port of the ONNX inference server with data encryption and attestation capabilities to enable confidential inference on Azure Confidential Computing.
Responsible-AI-Widgets - responsible AI user interfaces for Fairlearn, interpret-community, and Error Analysis, as well as foundational building blocks that they rely on.
Error Analysis - A toolkit to help analyze and improve model accuracy.
Secure Data Sandbox - A toolkit for conducting machine learning trials against confidential data.
shrike - Python utilities to aid "compliant experiment" in Azure Machine Learning - training ML models without seeing the training data.
HAX Toolkit - The Human-AI eXperience (HAX) Toolkit is a set of practical tools for creating human-AI experiences with people in mind from the beginning.
GAM Changer - Edit machine learning models to reflect human knowledge and values.
AdaTest - Find and fix bugs in natural language machine learning models using adaptive testing.

Optimization

ONNXRuntime - cross-platfom, high performance ML inference and training accelerator.
- ONNX Runtime for PyTorch - Accelerate PyTorch models with ONNX Runtime.
- ONNX Runtime Training Examples - examples for using onnx runtime for model training.
- ONNX Runtime Inference Examples - examples for using onnx runtime for model inference.
- ONNX Converter - common utilities for onnx converters.
- ONNX.js - run onnx models using javascript.
- ONNX.js Demo - demos for ONNX.js.
- Olive - a sequence of docker images that automates the process of ONNX model shipping.
Hummingbird - compile trained ml model into tensor computation for faster inference.
EdgeML - provides code for machine learning algorithms for edge devices developed at Microsoft Research India.
DirectML - high-performance, hardware-accelerated DirectX 12 library for machine learning.
MMdnn - MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization.
inifinibatch - Efficient, check-pointed data loading for deep learning with massive data sets.
InferenceSchema - Schema decoration for inference code
nnfusion - flexible and efficient deep neural network compiler.
Accera - Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research.

Reinforcement Learning

AirSim - open source simulator for autonomous vehicles build on unreal engine / unity from microsoft research.
TextWorld - TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.
Moab - Project Moab, a new open-source balancing robot to help engineers and developers learn how to build real-world autonomous control systems with Project Bonsai.
MARO - multi-agent resource optimization (MARO) platfom.
Training Data-Driven or Surrogate Simulators - build simulation from data for use in RL and Bonsai platform for machine teaching.
Bonsai - low code industrial machine teaching platform.
- Bonsai Python SDK - A python library for integrating data sources with Bonsai BRAIN.
SEGAR - Sandbox environment for generalizable agent research.

Security

counterfit - a CLI that provides a generic automation layer for assessing the security of ML models.
Federated Learning Simulation Framework - a flexible framework for running experiments with PyTorch models in a simulated Federated Learning (FL) environment.
FLUTE - a platform for conducting high-performance federated learning simulations.

Windows

Windows Machine Learning - Machine Learning on Windows.

Datasets

COCO Dataset - COCO is a large-scale object detection, segmentation, and captioning dataset.
MS MARCO - collection of datasets focused on deep learning in search.
InnerEye CreateDataset - InnerEye dataset creation tool for InnerEye-DeepLearning library. Transforms DICOM data into mask for training Deep Learning models.
Sepsis Cohort from MIMIC III - Sepsis cohort from MIMIC dataset.
MIND : Microsoft News Dataset - a large-scale dataset for news recommendation research.
Dataset for AI for Earth - AIForEarthDataSets is a collection of datasets for AI research.
ORBIT - a collection of videos of objects in clean and cluttered scenes recorded by people who are blind/low-vision on a mobile phone.
EcoFlows - Community-representation to collaborate on labelled AI data for ecological and agricultural scenarios in APAC, updated monthly.

Debug & Benchmark

tensorwatch - debugging, monitoring and visualization for python machine learning and data science.
PYRIGHT - static type checker for python.
Bench ML - Python library to benchmark popular pre-built cloud AI APIs.
debugpy - An implementation of the Debug Adapter Protocol for Python
kineto - A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters contributed by Azure AI Platform team.
SuperBenchmark - a benchmarking and diagnosis tool for AI infrastructure (software & hardware).
tempeh - tempeh is a framework to TEst Machine learning PErformance exHaustively which includes tracking memory usage and run time.

Pipeline

GitHub Actions - Automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub.
Azure Pipelines - Automate your builds and deployments with Pipelines so you spend less time with the nuts and bolts and more time being creative.
Dagli - framework for defining machine learning models, including feature generation and transformations as DAG.

Platform

AI for Earth API Platform - distributed infrastructure designed to provide a secure, scalable, and customizable API hosting, designed to handle the needs of long-running/asynchronous machine learning model inference.
Open Platfom for AI (OpenPAI) - resource scheduling and cluster management for AI.
- OpenPAI Runtime - Runtime for deep learning workload.
- OpenPAI Protocol - OpenPAI protocol enables job sharing and portability.
- Openpaimarketplace - A marketplace which stores examples and job templates of openpai.
- OpenPAI FrameworkController - built to orchestrate all kinds of applications on Kubernetes by a single controller.
- HivedDScheduler - Kubernetes Scheduler for Deep Learning.
- OpenPAI JS SDK - The JavaScript SDK is designed to facilitate the developers of OpenPAI to offer user friendly experience.
- OpenPAI VS Code Client - Extension to connect OpenPAI clusters, submit AI jobs, simulate jobs locally, manage files, and so on.
MLOS - Data Science powered infrastructure and methodology to democratize and automate Performance Engineering.
Platform for Situated Intelligence - an open-source framework for multimodal, integrative AI.
Qlib - an AI-oriented quantitative investment platform.

Feature Engineering

Feast on Azure - Azure plugins for Feast (FEAture STore).
Feathr - An Enterprise-Grade, High Performance Feature Store.

Tagging

TagAnomaly - Anomaly detection analysis and labeling tool, specifically for multiple time series (one time series per category)
VoTT - Visual object tagging tool
Satellite imagery annotation tool - A lightweight web-interface for creating and sharing vector annotations over satellite/aerial imagery scenes.

Developer tool

Visual Studio Code - Code editor redefined and optimized for building and debugging modern web and cloud applications.
Gather - adds gather functionality in the Python language to the Jupyter Extension.
Pylance - an extension that works alongside Python in Visual Studio Code to provide performant language support.
Azure ML Snippets - VSCode snippets for Azure Machine Learning

Sample Code

Microsoft AI for Earth
- Shared utilities - A collection of utilities for working with Azure Machine Learning.
- acoustic-bird-detection - Tutorial: Accurate Bioacoustic Species Detection from Small Numbers of Training Clips Using the Biophony Model
- beluasound - Using machine learning to detect beluga whale calls in hydrophone recordings.
- arcticseals - detect & classify arctic seals in aerial imagery to understand how they’re adapting to a changing world.
- AIDE: Annotation Interface for Data-driven Ecology - Detecting and classifying wildlife in aerial imagery.
- Camera Trap Tool - tools for training and running detectors and classifiers for wildlife images collected from motion-triggered cameras.
- Land cover mapping the Orinoquía region - A tool for predicting landcover in the Orinoquia region of Peru.
- Planetary Computer Hub - a development environment that makes our data and APIs accessible through familiar, open-source tools, and allows users to easily scale their analyses.
- Poultry barn mapping - code for detecting poultry barns from high-resolution aerial imagery and an accompanying dataset of predicted barns over the United States.
- Planetary Computer SDK for Python - A Python SDK for the Planetary Computer Hub.
- Species Classification - A tool for classifying species in images.
News Threads - The News Threads project analyzes news articles to help find similarities between news articles and trace news provenance across time.
InnerEye DeepLearning - Medical Imaging Deep Learning library to train and deploy models on Azure Machine Learning and Azure Stack
Deep Seismic - Deep Learning for Seismic Imaging and Interpretation
Multi-species bioacoustic classification - Multi-species bioacoustic classification using deep learning algorithms.
Nestle Acne Assessment - deep learning models to assess the acne severity level based on selfie images.
Visual Analogies - exploring the connections between artworks with deep "Visual Analogies".
Forecasting Best Practices - time series forecasting best practices & examples.
Computer Vision Recipes - best practices, code samples, and documentation for Computer Vision.
DeepSpeed Examples - Example models using DeepSpeed
A TALE OF THREE CITIES - Analyzing the safety (311) dataset published by Azure Open Datasets for Chicago, Boston and New York City using SparkR, SParkSQL, Azure Databricks, visualization using ggplot2 and leaflet.
Microsoft Health Intelligence Machine Learning Toolbox - Microsoft Health Intelligence Azure Machine Learning Toolbox.
Azure Machine Learning examples - official community-driven Azure Machine Learning examples, tested with GitHub Actions.
Azure Machine Learning R Template - patterns and examples for running R code with Azure Machine Learning.
Azure Machine Learning Python SDK notebooks - python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK.
Azure Machine Learning Gallery - this repo enables our growing community of developers and data scientists to share their machine learning pipelines, components, etc. to accelerate productivity in the machine learning lifecycle.
Azure Machine Learning previews - samples for preview features in Azure Machine Learning.
AzureML Designer Sample - samples of Azure Machine Learning designer.
Ship Detector - train an object detection model that can detect and locate ships in a satellite image.
Microsoft Solution Accelerators
- MLOps Solution Accelerator - this repository helps ML teams to accelerate their model deployment to production leveraging Azure.
- Anomaly Detection Solution Accelerator - implement Anomaly Detection which is the technique of identifying rare events or observations which can raise suspicions by being statistically different from the rest of the observations.
- Classification Solution Accelerator - This is a classification solution accelerator to help you build and deploy a binary classification project.
Medical Imaging with Azure Machine Learning Demos - medical imaging demo repository.
Federated Learning in Azure ML - Examples and recipes around federated learning in Azure ML.
Responsible AI Workshop - Responsible AI Workshop: a series of tutorials & walkthroughs to illustrate how put responsible AI into practice.
GlobalMLBuildingFootprints - Worldwide building footprints derived from satellite imagery.
Genomics Data Analysis with Jupyter Notebooks on Azure - Jupyter Notebooks on Azure for Genomics Data Analysis.

Community

AI@Edge Community - find the resources you need to create solutions using intelligence at the edge through combinations of hardware, machine learning (ML), artificial intelligence (AI) and Microsoft Azure service.
Global AI Community - empowers developers who are passionate about AI to share knowledge through events and meetups.
Deep Learning Lab (Japan) - provides information on development cases and the latest technology trends related to deep learning.

Workshop

Workshop Wiki

Competition

2020 Machine Learning Security Evasion Competition - code samples for the 2020 Machine Learning Security Evasion Competition.

Book

PRML - Pattern Recognition and Machine Learning by Christopher Bishop
Foundations of Data Science - Basic Theory for Data Science.
Mastering Azure Machine Learning: Perform large-scale end-to-end advanced machine learning in the cloud with Microsoft Azure Machine Learning
Practical Automated Machine Learning on Azure: Using Azure Machine Learning to Quickly Build AI Solutions

Learning

Microsoft Learn - Learning contents for Microsoft technology
- Data Scientist, AI Engineer
Data Science for Manager - Generalization, Utility, and Experimentation: ML Concepts for Making Better Business Decisions
Github Learning Lab - learning contents for Github technology.
Getting started with Python - Sample code for Channel 9 Python for Beginners course.
Get started with PyTorch- learn the fundamentals of deep learning with PyTorch.
Dev Intro to Data Science - In this 28-video series, you will learn important concepts and technologies to build your end-to-end machine learning applications on Azure.
Machine Learning for Beginners - A Curriculum - 12 weeks, 24 lessons, classic Machine Learning for all
Data Science for Beginners - A Curriculum - 10 Weeks, 20 Lessons, Data Science for All!
Artificial Intelligence for Beginners - 12 Weeks, 24 Lessons, AI for All!
Microsoft Cloud Workshop - Wide World Importers (WWI) delivers innovative solutions for manufacturers.

Blog, News & Webinar

channel9 - AI Show - videos for developers from people building Microsoft products and services.
Microsoft Open Source Blog - blog about microsoft open source technology.
Microsoft Research Event, Conference & Webinars - Events, Conferences & Webinars by Microsoft Research.
Microsoft Innovation Tech Hub - AI project in Microsoft.
LinkedIn Engineering Blog - Blog by LinkedIn Engineering Team
AI System - system for AI Education Resource (Chinese).
AI Edu - AI education materials for Chinese students, teachers and IT professionals (Chinese).

---

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

License

microsoft/machine-learning-collection

Folders and files

Latest commit

History

Repository files navigation