Tandem Mass Spectrum Prediction with Graph Transformers

Overview

MassFormer

This is the original implementation of MassFormer, a graph transformer for small molecule MS/MS prediction. Check out the preprint on arxiv.

Setting Up Environment

We recommend using conda. Three conda yml files are provided in the env/ directory (cpu.yml, cu101.yml, cu102.yml), providing different pytorch installation options (CPU-only, CUDA 10.1, CUDA 10.2). They can be trivially modified to support other versions of CUDA.

To set up an environment, run the command conda env create -f ${CONDA_YAML}, where ${CONDA_YAML} is the path to the desired yaml file.

Downloading NIST Data

Note: this step requires a Windows System or Virtual Machine

The NIST 2020 LC-MS/MS dataset can be purchased from an authorized distributor. The spectra and associated compounds can be exported to MSP/MOL format using the included lib2nist software. There is a single MSP file which contains all of the mass spectra, and multiple MOL files which include the molecular structure information for each spectrum (linked by ID). We've included a screenshot describing the lib2nist export settings.

Alt text

There is a minor bug in the export software that sometimes results in errors when parsing the MOL files. To fix this bug, run the script python mol_fix.py ${MOL_DIR}, where ${MOL_DIR} is a path to the NIST export directory with MOL files.

Downloading Massbank Data

The MassBank of North America (MB-NA) data is in MSP format, with the chemical information provided in the form of a SMILES string (as opposed to a MOL file). It can be downloaded from the MassBank website, under the tab "LS-MS/MS Spectra".

Exporting and Preparing Data

We recommend creating a directory called data/ and placing the downloaded and uncompressed data into a folder data/raw/.

To parse both of the datasets, run parse_and_export.py. Then, to prepare the data for model training, run prepare_data.py. By default the processed data will end up in data/proc/.

Setting Up Weights and Biases

Our implementation uses Weights and Biases (W&B) for logging and visualization. For full functionality, you must set up a free W&B account.

Training Models

A default config file is provided in "config/template.yml". This trains a MassFormer model on the NIST HCD spectra. Our experiments used systems with 32GB RAM, 1 Nvidia RTX 2080 (11GB VRAM), and 6 CPU cores.

The config/ directory has a template config file template.yml and 8 files corresponding to the experiments from the paper. The template config can be modified to train models of your choosing.

To train a template model without W&B with only CPU, run python runner.py -w False -d -1

To train a template model with W&B on CUDA device 0, run python runner.py -w True -d 0

Reproducing Tables

To reproduce a model from one of the experiments in Table 2 or Table 3 from the paper, run python runner.py -w True -d 0 -c ${CONFIG_YAML} -n 5 -i ${RUN_ID}, where ${CONFIG_YAML} refers to a specific yaml file in the config/ directory and ${RUN_ID} refers to an arbitrary but unique integer ID.

Reproducing Visualizations

The explain.py script can be used to reproduce the visualizations in the paper, but requires a trained model saved on W&B (i.e. by running a script from the previous section).

To reproduce a visualization from Figures 2,3,4,5, run python explain.py ${WANDB_RUN_ID} --wandb_mode=online, where ${WANDB_RUN_ID} is the unique W&B run id of the desired model's completed training script. The figues will be uploaded as PNG files to W&B.

Reproducing Sweeps

The W&B sweep config files that were used to select model hyperparameters can be found in the sweeps/ directory. They can be initialized using wandb sweep ${PATH_TO_SWEEP}.

Owner
Röst Lab
Röst lab at U of T -- join us at https://gitter.im/Roestlab/Lobby
Röst Lab
Automatically visualize your pandas dataframe via a single print! 📊 💡

A Python API for Intelligent Visual Discovery Lux is a Python library that facilitate fast and easy data exploration by automating the visualization a

Lux 4.3k Dec 28, 2022
Python+Numpy+OpenGL: fast, scalable and beautiful scientific visualization

Python+Numpy+OpenGL: fast, scalable and beautiful scientific visualization

Glumpy 1.1k Jan 05, 2023
Dimensionality reduction in very large datasets using Siamese Networks

ivis Implementation of the ivis algorithm as described in the paper Structure-preserving visualisation of high dimensional single-cell datasets. Ivis

beringresearch 284 Jan 01, 2023
This is a super simple visualization toolbox (script) for transformer attention visualization ✌

Trans_attention_vis This is a super simple visualization toolbox (script) for transformer attention visualization ✌ 1. How to prepare your attention m

Mingyu Wang 3 Jul 09, 2022
GD-UltraHack - A Mod Menu for Geometry Dash. Specifically a MegahackV5 clone in Python. Only for Windows

GD UltraHack: The Mod Menu that Nobody asked for. This is a mod menu for the gam

zeo 1 Jan 05, 2022
A concise grammar of interactive graphics, built on Vega.

Vega-Lite Vega-Lite provides a higher-level grammar for visual analysis that generates complete Vega specifications. You can find more details, docume

Vega 4k Jan 08, 2023
A python script editor for napari based on PyQode.

napari-script-editor A python script editor for napari based on PyQode. This napari plugin was generated with Cookiecutter using with @napari's cookie

Robert Haase 9 Sep 20, 2022
Python histogram library - histograms as updateable, fully semantic objects with visualization tools. [P]ython [HYST]ograms.

physt P(i/y)thon h(i/y)stograms. Inspired (and based on) numpy.histogram, but designed for humans(TM) on steroids(TM). The goal is to unify different

Jan Pipek 120 Dec 08, 2022
An open-source plotting library for statistical data.

Lets-Plot Lets-Plot is an open-source plotting library for statistical data. It is implemented using the Kotlin programming language. The design of Le

JetBrains 820 Jan 06, 2023
GUI for visualization and interactive editing of SMPL-family body models ie. SMPL, SMPL-X, MANO, FLAME.

Body Model Visualizer Introduction This is a simple Open3D-based GUI for SMPL-family body models. This GUI lets you play with the shape, expression, a

Muhammed Kocabas 207 Jan 01, 2023
mysql relation charts

sqlcharts 自动生成数据库关联关系图 复制settings.py.example 重命名为settings.py 将数据库配置信息填入settings.DATABASE,目前支持mysql和postgresql 执行 python build.py -b,-b是读取数据库表结构,如果只更新匹

6 Aug 22, 2022
SummVis is an interactive visualization tool for text summarization.

SummVis is an interactive visualization tool for analyzing abstractive summarization model outputs and datasets.

Robustness Gym 246 Dec 08, 2022
Yata is a fast, simple and easy Data Visulaization tool, running on python dash

Yata is a fast, simple and easy Data Visulaization tool, running on python dash. The main goal of Yata is to provide a easy way for persons with little programming knowledge to visualize their data e

Cybercreek 3 Jun 28, 2021
Easily configurable, chart dashboards from any arbitrary API endpoint. JSON config only

Flask JSONDash Easily configurable, chart dashboards from any arbitrary API endpoint. JSON config only. Ready to go. This project is a flask blueprint

Chris Tabor 3.3k Dec 31, 2022
View part of your screen in grayscale or simulated color vision deficiency.

monolens View part of your screen in grayscale or filtered to simulate color vision deficiency. Watch the demo on YouTube. Install with pip install mo

Hans Dembinski 31 Oct 11, 2022
Plotting library for IPython/Jupyter notebooks

bqplot 2-D plotting library for Project Jupyter Introduction bqplot is a 2-D visualization system for Jupyter, based on the constructs of the Grammar

3.4k Dec 30, 2022
Functions for easily making publication-quality figures with matplotlib.

Data-viz utils 📈 Functions for data visualization in matplotlib 📚 API Can be installed using pip install dvu and then imported with import dvu. You

Chandan Singh 16 Sep 15, 2022
A tool for automatically generating 3D printable STLs from freely available lidar scan data.

mini-map-maker A tool for automatically generating 3D printable STLs from freely available lidar scan data. Screenshots Tutorial To use this script, g

Mike Abbott 51 Nov 06, 2022
Python & Julia port of codes in excellent R books

X4DS This repo is a collection of Python & Julia port of codes in the following excellent R books: An Introduction to Statistical Learning (ISLR) Stat

Gitony 5 Jun 21, 2022
Learning Convolutional Neural Networks with Interactive Visualization.

CNN Explainer An interactive visualization system designed to help non-experts learn about Convolutional Neural Networks (CNNs) For more information,

Polo Club of Data Science 6.3k Jan 01, 2023