MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data

Related tags

Machine LearningMCML
Overview

MCML

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use for single-cell datasets though the method can use any matrix as input.

MCML modules include the MCML and bMCML algorithms for dimensionality reduction, and MCML tools include functions for quantitative analysis of inter- and intra- distances between labeled groups and nearest neighbor metrics in the latent or ambient space. The modules are autoencoder-based neural networks with label-aware cost functions for weight optimization.

Briefly, MCML adapts the Neighborhood Component Analysis algorithm to utilize mutliple classes of labels for each observation (cell) to embed observations of the same labels close to each other. This essentially optimizes the latent space for k-Nearest Neighbors (KNN) classification.

bMCML demonstrates targeted reconstruction error, which optimizes for recapitulation of intra-label distances (the pairwise distances between cells within the same label).

tools include functions for inter- and intra-label distance calculations as well as metrics on the labels of n the k nearest neighbors of each observation. These can be performed on any latent or ambient space (matrix) input.

Requirements

You need Python 3.6 or later to run MCML. You can have multiple Python versions (2.x and 3.x) installed on the same system without problems.

In Ubuntu, Mint and Debian you can install Python 3 like this:

$ sudo apt-get install python3 python3-pip

For other Linux distributions, macOS and Windows, packages are available at

https://www.python.org/getit/

Quick start

MCML can be installed using pip:

$ python3 -m pip install -U MCML

If you want to run the latest version of the code, you can install from git:

$ python3 -m pip install -U git+git://github.com/pachterlab/MCML.git

Examples

Example data download:

$ wget --quiet https://caltech.box.com/shared/static/i66kelel9ouep3yw8bn2duudkqey190j
$ mv i66kelel9ouep3yw8bn2duudkqey190j mat.mtx
$ wget --quiet https://caltech.box.com/shared/static/dcmr36vmsxgcwneh0attqt0z6qm6vpg6
$ mv dcmr36vmsxgcwneh0attqt0z6qm6vpg6 metadata.csv

Extract matrix (obs x features) and labels for each obs:

>>> import pandas as pd
>>> import scipy.io as sio
>>> import numpy as np

>>> mat = sio.mmread('mat.mtx') #Is a centered and scaled matrix (scaling input is optional)
>>> mat.shape
(3850, 1999)

>>> meta = pd.read_csv('metadata.csv')
>>> meta.head()
 Unnamed: 0          sample_name  smartseq_cluster_id  smartseq_cluster  ... n_genes percent_mito pass_count_filter  pass_mito_filter
0  SM-GE4R2_S062_E1-50  SM-GE4R2_S062_E1-50                   46   Nr5a1_9|11 Rorb  ...    9772          0.0              True              True
1  SM-GE4SI_S356_E1-50  SM-GE4SI_S356_E1-50                   46   Nr5a1_9|11 Rorb  ...    8253          0.0              True              True
2  SM-GE4SI_S172_E1-50  SM-GE4SI_S172_E1-50                   46   Nr5a1_9|11 Rorb  ...    9394          0.0              True              True
3   LS-15034_S07_E1-50   LS-15034_S07_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10643          0.0              True              True
4   LS-15034_S28_E1-50   LS-15034_S28_E1-50                   42  Nr5a1_4|7 Glipr1  ...   10550          0.0              True              True

>>> cellTypes = list(meta.smartseq_cluster)
>>> sexLabels = list(meta.sex_label)
>>> len(sexLabels)
3850



To run the MCML algorithm for dimensionality reduction (Python 3):

>>> from MCML.modules import MCML, bMCML

>>> mcml = MCML(n_latent = 50, epochs = 100) #Initialize MCML class

>>> latentMCML = mcml.fit(mat, np.array([cellTypes,sexLabels]) , fracNCA = 0.8 , silent = True) #Run MCML
>>> latentMCML.shape
(3850, 50)

This incorporates both the cell type and sex labels into the latent space construction. Use plotLosses() to view the loss function components over the training epochs.

>>> mcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs



To run the bMCML algorithm for dimensionality reduction (Python 3):

>>> bmcml = bMCML(n_latent = 50, epochs = 100) #Initialize bMCML class


>>> latentbMCML = bmcml.fit(mat, np.array(cellTypes), np.array(sexLabels), silent=True) #Run bMCML
>>> latentbMCML.shape
(3850, 50)

>>> bmcml.plotLosses(figsize=(10,3),axisFontSize=10,tickFontSize=8) #Plot loss over epochs

bMCML is optimizing for the intra-distances of the sex labels i.e. the pairwise distances of cells in each sex for each cell type.

For both bMCML and MCML objects, fit() can be replaced with trainTest() to train the algorithms on a subset of the full data and apply the learned weights to the remaining test data. This offers a method assessing overfitting.



To use the metrics available in tools:

>>> from MCML import tools as tl

#Pairwise distances between centroids of cells in each label
>>> cDists = tl.getCentroidDists(mat, np.array(cellTypes)) 
>>> len(cDists)
784

#Avg pairwise distances between cells of *both* sexes, for each cell type
>>> interDists = tl.getInterVar(mat, np.array(cellTypes), np.array(sexLabels))  
>>> len(interDists)
27

#Avg pairwise distances between cells of the *same* sex, for each cell type
>>> intraDists = tl.getIntraVar(mat, np.array(cellTypes), np.array(sexLabels)) 
>>> len(intraDists)
53

#Fraction of neighbors for each cell with same label as cell itself (also returns which labels neighbors have)
>>> neighbor_fracs, which_labels = tl.frac_unique_neighbors(mat, np.array(cellTypes), metric = 1,neighbors = 30)

#Get nearest neighbors for any embedding
>>> orig_neigh = tl.getNeighbors(mat, n_neigh = 15, p=1)
>>> latent_neigh = tl.getNeighbors(latentMCML, n_neigh = 15, p=1)

#Get Jaccard distance between latent and ambient nearest neighbors
>>> jac_dists = tl.getJaccard(orig_neigh, latent_neigh)
>>>len(jac_dists)
3850



To see further details of all inputs and outputs for all functions use:

>>> help(MCML)
>>> help(bMCML)
>>> help(tl)

License

MCML is licensed under the terms of the BSD License (see the file LICENSE).

Owner
Pachter Lab
Pachter Lab
Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them.

Anirudh Edpuganti 3 Apr 03, 2022
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 07, 2023
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 04, 2022
Production Grade Machine Learning Service

This project is made to help you scale from a basic Machine Learning project for research purposes to a production grade Machine Learning web service

Abdullah Zaiter 10 Apr 04, 2022
Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Credit Card Fraud Detection, used the credit card fraud dataset from Kaggle

Sean Zahller 1 Feb 04, 2022
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API.

7.4k Jan 04, 2023
Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

2 Jun 14, 2022
Scikit-Garden or skgarden is a garden for Scikit-Learn compatible decision trees and forests.

Scikit-Garden or skgarden (pronounced as skarden) is a garden for Scikit-Learn compatible decision trees and forests.

260 Dec 21, 2022
The Ultimate FREE Machine Learning Study Plan

The Ultimate FREE Machine Learning Study Plan

Patrick Loeber (Python Engineer) 2.5k Jan 05, 2023
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
Katana project is a template for ASAP 🚀 ML application deployment

Katana project is a FastAPI template for ASAP 🚀 ML API deployment

Mohammad Shahebaz 100 Dec 26, 2022
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Chris Santiago 0 Mar 30, 2022
TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models

538 Jan 01, 2023
A collection of neat and practical data science and machine learning projects

Data Science A collection of neat and practical data science and machine learning projects Explore the docs » Report Bug · Request Feature Table of Co

Will Fong 2 Dec 10, 2021
Python Machine Learning Jupyter Notebooks (ML website)

Python Machine Learning Jupyter Notebooks (ML website) Dr. Tirthajyoti Sarkar, Fremont, California (Please feel free to connect on LinkedIn here) Also

Tirthajyoti Sarkar 2.6k Jan 03, 2023
Programming assignments and quizzes from all courses within the Machine Learning Engineering for Production (MLOps) specialization offered by deeplearning.ai

Machine Learning Engineering for Production (MLOps) Specialization on Coursera (offered by deeplearning.ai) Programming assignments from all courses i

Aman Chadha 173 Jan 05, 2023
Magenta: Music and Art Generation with Machine Intelligence

Magenta is a research project exploring the role of machine learning in the process of creating art and music. Primarily this involves developing new

Magenta 18.1k Dec 30, 2022
Markov bot - A Writing bot based on Markov Chain for Data Structure Lab

基于马尔可夫链的写作机器人 前端 用html/css完成 Demo展示(已给出文本的相应展示) 用户提供相关的语料库后训练的成果 后端 要完成的几个接口 解析文

DysprosiumDy 9 May 05, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 07, 2023