A benchmark dataset for emulating atmospheric radiative transfer in weather and climate models with machine learning (NeurIPS 2021 Datasets and Benchmarks Track)

Overview

ClimART - A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models

Python PyTorch CC BY 4.0

Official PyTorch Implementation

Using deep learning to optimise radiative transfer calculations.

Preliminary paper to appear at NeurIPS 2021 Datasets Track: https://openreview.net/forum?id=FZBtIpEAb5J

Abstract: Numerical simulations of Earth's weather and climate require substantial amounts of computation. This has led to a growing interest in replacing subroutines that explicitly compute physical processes with approximate machine learning (ML) methods that are fast at inference time. Within weather and climate models, atmospheric radiative transfer (RT) calculations are especially expensive. This has made them a popular target for neural network-based emulators. However, prior work is hard to compare due to the lack of a comprehensive dataset and standardized best practices for ML benchmarking. To fill this gap, we build a large dataset, ClimART, with more than 10 million samples from present, pre-industrial, and future climate conditions, based on the Canadian Earth System Model. ClimART poses several methodological challenges for the ML community, such as multiple out-of-distribution test sets, underlying domain physics, and a trade-off between accuracy and inference speed. We also present several novel baselines that indicate shortcomings of datasets and network architectures used in prior work.

Contact: Venkatesh Ramesh (venka97 at gmail) or Salva Rühling Cachay (salvaruehling at gmail).

Overview:

  • climart/: Package with the main code, baselines and ML training logic.
  • notebooks/: Notebooks for visualization of data.
  • analysis/: Scripts to create visualization of the results (requires logging).
  • scripts/: Scripts to train and evaluate models, and to download the whole ClimART dataset.

Getting Started

Requirements

  • Linux and Windows are supported, but we recommend Linux for performance and compatibility reasons.
  • NVIDIA GPUs with at least 8 GB of memory and system with 12 GB RAM (More RAM is required if training with --load_train_into_mem option which allows for faster training). We have done all testing and development using NVIDIA V100 GPUs.
  • 64-bit Python >=3.7 and PyTorch >=1.8.1. See https://pytorch.org/ for PyTorch install instructions.
  • Python libraries mentioned in ``env.yml`` file, see Getting Started (Need to have miniconda/conda installed).

Downloading the ClimART Dataset

By default, only a subset of CLimART is downloaded. To download the train/val/test years you want, please change the loop in ``data_download.sh.`` appropriately. To download the whole ClimART dataset, you can simply run

bash scripts/download_climart_full.sh 

conda env create -f env.yml   # create new environment will all dependencies
conda activate climart  # activate the environment called 'climart'
bash data_download.sh  # download the dataset (or a subset of it, see above)
# For one of {CNN, GraphNet, GCN, MLP}, run the model with its lowercase name with the following commmand:
bash scripts/train_<model-name>.sh

Dataset Structure

To avoid storage redundancy, we store one single input array for both pristine- and clear-sky conditions. The dimensions of ClimART’s input arrays are:

  • layers: (N, 49, D-lay)
  • levels: (N, 50, 4)
  • globals: (N, 82)

where N is the data dimension (i.e. the number of examples of a specific year, or, during training, of a batch), 49 and 50 are the number of layers and levels in a column respectively. Dlay, 4, 82 is the number of features/channels for layers, levels, globals respectively.

For pristine-sky Dlay = 14, while for clear-sky Dlay = 45, since it contains extra aerosol related variables. The array for pristine-sky conditions can be easily accessed by slicing the first 14 features out of the stored array, e.g.: pristine_array = layers_array[:, :, : 14]

The complete list of variables in the dataset is as follows:

Variables List

Training Options

--exp_type: "pristine" or "clear_sky" for training on the respective atmospheric conditions.
--target_type: "longwave" (thermal) or "shortwave" (solar) for training on the respective radiation type targets.
--target_variable: "Fluxes" or "Heating-rate" for training on profiles of fluxes or heating rates.
--model: ML model architecture to select for training (MLP, GCN, GN, CNN)
--workers: The number of workers to use for dataloading/multi-processing.
--device: "cuda" or "cpu" to use GPUs or not.
--load_train_into_mem: Whether to load the training data into memory (can speed up training)
--load_val_into_mem: Whether to load the validation data into memory (can speed up training)
--lr: The learning rate to use for training.
--epochs: Number of epochs to train the model for.
--optim: The choice of optimizer to use (e.g. Adam)
--scheduler: The learning rate scheduler used for training (expdecay, reducelronplateau, steplr, cosine).
--weight_decay: Weight decay to use for the optimization process.
--batch_size: Batch size for training.
--act: Activation function (e.g. ReLU, GeLU, ...).
--hidden_dims: The hidden dimensionalities to use for the model (e.g. 128 128).
--dropout: Dropout rate to use for parameters.
--loss: Loss function to train the model with (MSE recommended).
--in_normalize: Select how to normalize the data (Z, min_max, None). Z-scaling is recommended.
--net_norm: Normalization scheme to use in the model (batch_norm, layer_norm, instance_norm)
--gradient_clipping: If "norm", the L2-norm of the parameters is clipped the value of --clip. Otherwise no clipping.
--clip: Value to clip the gradient to while training.
--val_metric: Which metric to use for saving the 'best' model based on validation set. Default: "RMSE"
--gap: Use global average pooling in-place of MLP to get output (CNN only).
--learn_edge_structure: If --model=='GCN': Whether to use a L-GCN (if set) with learnable adjacency matrix, or a GCN.
--train_years: The years to select for training the data. (Either individual years 1997+1991 or range 1991-1996)
--validation_years: The years to select for validating the data. Recommended: "2005" or "2005-06" 
--test_ood_1991: Whether to load and test on OOD data from 1991 (Mt. Pinatubo; especially challenging for clear-sky conditions)
--test_ood_historic: Whether to load and test on historic/pre-industrial OOD data from 1850-52.
--test_ood_future: Whether to load and test on future OOD data from 2097-99 (under a changing climate/radiative forcing)
--wandb_model: If "online", Weights&Biases logging. If "disabled" no logging.
--expID: A unique ID for the experiment if using logging.

Reproducing our Baselines

To reproduce our paper results (for seed = 7) you may run the following commands in a shell.

CNN

python main.py --model "CNN" --exp_type "pristine" --target_type "shortwave" --workers 6 --seed 7 \
  --batch_size 128 --lr 2e-4 --optim Adam --weight_decay 1e-6 --scheduler "expdecay" \
  --in_normalize "Z" --net_norm "none" --dropout 0.0 --act "GELU" --epochs 100 \
  --gap --gradient_clipping "norm" --clip 1.0 \
  --train_years "1990+1999+2003" --validation_years "2005" \
  --wandb_mode disabled

MLP

python main.py --model "MLP" --exp_type "pristine" --target_type "shortwave" --workers 6 --seed 7 \
  --batch_size 128 --lr 2e-4 --optim Adam --weight_decay 1e-6 --scheduler "expdecay" \
  --in_normalize "Z" --net_norm "layer_norm" --dropout 0.0 --act "GELU" --epochs 100 \
  --gradient_clipping "norm" --clip 1.0 --hidden_dims 512 256 256 \
  --train_years "1990+1999+2003" --validation_years "2005" \
  --wandb_mode disabled

GCN

python main.py --model "GCN+Readout" --exp_type "pristine" --target_type "shortwave" --workers 6 --seed 7 \
  --batch_size 128 --lr 2e-4 --optim Adam --weight_decay 1e-6 --scheduler "expdecay" \
  --in_normalize "Z" --net_norm "layer_norm" --dropout 0.0 --act "GELU" --epochs 100 \
  --preprocessing "mlp_projection" --projector_net_normalization "layer_norm" --graph_pooling "mean"\
  --residual --improved_self_loops \
  --gradient_clipping "norm" --clip 1.0 --hidden_dims 128 128 128 \  
  --train_years "1990+1999+2003" --validation_years "2005" \
  --wandb_mode disabled

Logging

Currently, logging is disabled by default. However, the user may use wandb to log the experiments by passing the argument --wandb_mode=online

Notebooks

There are some jupyter notebooks in the notebooks folder which we used for plotting, benchmarking etc. You may go through them to visualize the results/benchmark the models.

License:

This work is made available under Attribution 4.0 International (CC BY 4.0) license. CC BY 4.0

Development

This repository is currently under active development and you may encounter bugs with some functionality. Any feedback, extensions & suggestions are welcome!

Citation

If you find ClimART or this repository helpful, feel free to cite our publication:

@inproceedings{cachay2021climart,
    title={{ClimART}: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models},
    author={Salva R{\"u}hling Cachay and Venkatesh Ramesh and Jason N. S. Cole and Howard Barker and David Rolnick},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2021},
    url={https://openreview.net/forum?id=FZBtIpEAb5J}
}
Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper.

Fully Adaptive Bayesian Algorithm for Data Analysis FABADA FABADA is a novel non-parametric noise reduction technique which arise from the point of vi

18 Oct 20, 2022
TorchOk - The toolkit for fast Deep Learning experiments in Computer Vision

TorchOk - The toolkit for fast Deep Learning experiments in Computer Vision

52 Dec 23, 2022
Code and models for "Rethinking Deep Image Prior for Denoising" (ICCV 2021)

DIP-denosing This is a code repo for Rethinking Deep Image Prior for Denoising (ICCV 2021). Addressing the relationship between Deep image prior and e

Computer Vision Lab. @ GIST 36 Dec 29, 2022
Edison AT is software Depression Assistant personal.

Edison AT Edison AT is software / program Depression Assistant personal. Feature: Analyze emotional real-time from face. Audio Edison(Comingsoon relea

Ananda Rauf 2 Apr 24, 2022
Realtime YOLO Monster Detection With Non Maximum Supression

Realtime-YOLO-Monster-Detection-With-Non-Maximum-Supression Table of Contents In

5 Oct 07, 2022
In the AI for TSP competition we try to solve optimization problems using machine learning.

AI for TSP Competition Goal In the AI for TSP competition we try to solve optimization problems using machine learning. The competition will be hosted

Paulo da Costa 11 Nov 27, 2022
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 03, 2023
Hand tracking demo for DIY Smart Glasses with a remote computer doing the work

CameraStream This is a demonstration that streams the image from smartglasses to a pc, does the hand recognition on the remote pc and streams the proc

Teemu Laurila 20 Oct 13, 2022
a baseline to practice

ccks2021_track3_baseline a baseline to practice 路径可能会有问题,自己改改 torch==1.7.1 pyhton==3.7.1 transformers==4.7.0 cuda==11.0 this is a baseline, you can fi

45 Nov 23, 2022
Canonical Capsules: Unsupervised Capsules in Canonical Pose (NeurIPS 2021)

Canonical Capsules: Unsupervised Capsules in Canonical Pose (NeurIPS 2021) Introduction This is the official repository for the PyTorch implementation

165 Dec 07, 2022
BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

290 Dec 29, 2022
Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

ToxiChat Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Install depen

Ashutosh Baheti 11 Jan 01, 2023
FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks

FinGAT: A Financial Graph Attention Networkto Recommend Top-K Profitable Stocks This is our implementation for the paper: FinGAT: A Financial Graph At

Yu-Che Tsai 64 Dec 13, 2022
Generative Models for Graph-Based Protein Design

Graph-Based Protein Design This repo contains code for Generative Models for Graph-Based Protein Design by John Ingraham, Vikas Garg, Regina Barzilay

John Ingraham 159 Dec 15, 2022
PyTorch implementation of Algorithm 1 of "On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models"

Code for On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models This repository will reproduce the main results from our pape

Mitch Hill 32 Nov 25, 2022
Lightweight Face Image Quality Assessment

LightQNet This is a demo code of training and testing [LightQNet] using Tensorflow. Uncertainty Losses: IDQ loss PCNet loss Uncertainty Networks: Mobi

Kaen 5 Nov 18, 2022
Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Official Tensorflow implementation of "M-LSD: Towards Light-weight and Real-time Line

NAVER/LINE Vision 357 Jan 04, 2023
Augmentation for Single-Image-Super-Resolution

SRAugmentation Augmentation for Single-Image-Super-Resolution Implimentation CutBlur Cutout CutMix Cutup CutMixup Blend RGBPermutation Identity OneOf

Yubo 6 Jun 27, 2022
GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery This is the code to the paper: Gradient-Based Learn

3 Feb 15, 2022
Learning-Augmented Dynamic Power Management

Learning-Augmented Dynamic Power Management This repository contains source code accompanying paper Learning-Augmented Dynamic Power Management with M

Adam 0 Feb 22, 2022