Kaggle | 9th place (part of) solution for the Bristol-Myers Squibb – Molecular Translation challenge

Last update: Nov 30, 2022

Related tags

Overview

Part of the 9th place solution for the Bristol-Myers Squibb – Molecular Translation challenge translating images containing chemical structures into InChI (International Chemical Identifier) texts.

This repo is partially based on the following resources:

Y.Nakama's tokenization
Heng's transformer decoder
Sam Stainsby's external images creation updated by ZFTurbo

Requirements

install and activate the conda environment
download and extract the data into /data/bms/
extract and move sample_submission_with_length.csv.gz into /data/bms/
tokenize training inputs: python datasets/prepocess2.py
if you want to use pseudo labeling, execute: python datasets/pseudo_prepocess2.py your_submission_file.csv
if you want to use external images, you can create with the following commands:

python r09_create_images_from_allowed_inchi.py
python datasets/extra_prepocess2.py

and also install apex

Training

This repo supports training any VIT/SWIN/CAIT transformer models from timm as encoder together with the fairseq transformer decoder.

Here is an example configuration to train a SWIN swin_base_patch4_window12_384 as encoder and 12 layer 16 head fairseq decoder:

python -m torch.distributed.launch --nproc_per_node=N train.py --logdir=logdir/ \
    --pipeline --train-batch-size=50 --valid-batch-size=128 --dataload-workers-nums=10 --mixed-precision --amp-level=O2  \
    --aug-rotate90-p=0.5 --aug-crop-p=0.5 --aug-noise-p=0.9 --label-smoothing=0.1 \
    --encoder-lr=1e-3 --decoder-lr=1e-3 --lr-step-ratio=0.3 --lr-policy=step --optim=adam --lr-warmup-steps=1000 --max-epochs=20 --weight-decay=0 --clip-grad-norm=1 \
    --verbose --image-size=384 --model=swin_base_patch4_window12_384 --loss=ce --embed-dim=1024 --num-head=16 --num-layer=12 \
    --fold=0 --train-dataset-size=0 --valid-dataset-size=65536 --valid-dataset-non-sorted

For pseudo labeling, use --pseudo=pseudo.pkl. If you want subsample the pseudo dataset, use: --pseudo-dataset-size=448000. For using external images, use --extra (--extra-dataset-size=448000).

After training, you can also use Stochastic Weight Averaging (SWA) which gives a boost around 0.02:

python swa.py --image-size=384 --input logdir/epoch-17.pth,logdir/epoch-18.pth,logdir/epoch-19.pth,logdir/epoch-20.pth

Inference

Evaluation:

python -m torch.distributed.launch --nproc_per_node=N eval.py --mixed-precision --batch-size=128 swa_model.pth

Inference:

python -m torch.distributed.launch --nproc_per_node=N inference.py --mixed-precision --batch-size=128 swa_model.pth

Normalization with RDKit:

./normalize_inchis.sh submission.csv

Kaggle | 9th place (part of) solution for the Bristol-Myers Squibb – Molecular Translation challenge

Related tags

Overview

Requirements

Training

Inference

Owner

Erdene-Ochir Tuguldur

Code base for NeurIPS 2021 publication titled Kernel Functional Optimisation (KFO)

Boost learning for GNNs from the graph structure under challenging heterophily settings. (NeurIPS'20)

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

A data-driven maritime port simulator

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

PyTorch and GPyTorch implementation of the paper "Conditioning Sparse Variational Gaussian Processes for Online Decision-making."

Data Augmentation Using Keras and Python

Unofficial implementation of Pix2SEQ

Learned Initializations for Optimizing Coordinate-Based Neural Representations

SW components and demos for visual kinship recognition. An emphasis is put on the FIW dataset-- data loaders, benchmarks, results in summary.

Clustering is a popular approach to detect patterns in unlabeled data

How to train a CNN to 99% accuracy on MNIST in less than a second on a laptop

Code for the paper "Curriculum Dropout", ICCV 2017

wgan, wgan2(improved, gp), infogan, and dcgan implementation in lasagne, keras, pytorch

Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting

This repository is an open-source implementation of the ICRA 2021 paper: Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling.

This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML)

Code for Iso-Points: Optimizing Neural Implicit Surfaces with Hybrid Representations

Face Recognition plus identification simply and fast | Python