Galaxy images labelled by morphology (shape). Aimed at ML development and teaching

Last update: Nov 28, 2022

Related tags

Overview

GalaxyMNIST

Galaxy images labelled by morphology (shape). Aimed at ML debugging and teaching.

Contains 10,000 images of galaxies (3x64x64), confidently labelled by Galaxy Zoo volunteers as belonging to one of four morphology classes.

Installation

git clone https://github.com/mwalmsley/galaxy_mnist
pip install -e galaxy_mnist

The only dependencies are pandas, scikit-learn, and h5py (for .hdf5 support). (py)torch is required but not specified as a dependency, because you likely already have it and may require a very specific version (e.g. from conda, AWS-optimised, etc).

Use

Simply use as with MNIST:

from galaxy_mnist import GalaxyMNIST

dataset = GalaxyMNIST(
    root='/some/download/folder',
    download=True,
    train=True  # by default, or set False for test set
)

Access the images and labels - in a fixed "canonical" 80/20 train/test division - like so:

images, labels = dataset.data, dataset.targets

You can also divide the data according to your own to your own preferences with load_custom_data:

(custom_train_images, custom_train_labels), (custom_test_images, custom_test_labels) = dataset.load_custom_data(test_size=0.8, stratify=True)

See load_in_pytorch.py for a working example.

Dataset Details

GalaxyMNIST has four classes: smooth and round, smooth and cigar-shaped, edge-on-disk, and unbarred spiral (you can retrieve this as a list with GalaxyMNIST.classes).

The galaxies are selected from Galaxy Zoo DECaLS Campaign A (GZD-A), which classified images taken by DECaLS and released in DR1 and 2. The images are as shown to volunteers on Galaxy Zoo, except for a 75% crop followed by a resize to 64x64 pixels.

At least 17 people must have been asked the necessary questions, and at least half of them must have answered with the given class. The class labels are therefore much more confident than from, for example, simply labelling with the most common answer to some question.

The classes are balanced exactly equally across the whole dataset (2500 galaxies per class), but only approximately equally (by random sampling) in the canonical train/test split. For a split with exactly equal classes on both sides, use load_custom_data with stratify=True.

You can see the exact choices made to select the galaxies and labels under the reproduce folder. This includes the notebook exploring and selecting choices for pruning the decision tree, and the script for saving the final dataset(s).

Citations and Further Reading

If you use this dataset, please cite Galaxy Zoo DECaLS, the data release paper from which the labels are drawn. Please also acknowledge the DECaLS survey (see the linked paper for an example).

You can find the original volunteer votes (and images) on Zenodo here.

Galaxy images labelled by morphology (shape). Aimed at ML development and teaching

Related tags

Overview

GalaxyMNIST

Installation

Use

Dataset Details

Citations and Further Reading

Owner

Mike Walmsley

An Implementation of Fully Convolutional Networks in Tensorflow.

Time Series Cross-Validation -- an extension for scikit-learn

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"

Relative Human dataset, CVPR 2022

Deep Inside Convolutional Networks - This is a caffe implementation to visualize the learnt model

Split your patch similarly to `git add -p` but supporting multiple buckets

Think Big, Teach Small: Do Language Models Distil Occam’s Razor?

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Code to compute permutation and drop-column importances in Python scikit-learn models

“英特尔创新大师杯”深度学习挑战赛赛道3：CCKS2021中文NLP地址相关性任务

This is a repository for a Semantic Segmentation inference API using the Gluoncv CV toolkit

Deep Image Matting implementation in PyTorch

Web service for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation based on OpenFace 2.0

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

Background Matting: The World is Your Green Screen

Experiments and code to generate the GINC small-scale in-context learning dataset from "An Explanation for In-context Learning as Implicit Bayesian Inference"

Classic Papers for Beginners and Impact Scope for Authors.

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

Galaxy images labelled by morphology (shape). Aimed at ML development and teaching

Related tags

Overview

GalaxyMNIST

Installation

Use

Dataset Details

Citations and Further Reading

Owner

Mike Walmsley

An Implementation of Fully Convolutional Networks in Tensorflow.

Time Series Cross-Validation -- an extension for scikit-learn

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"

Relative Human dataset, CVPR 2022

Deep Inside Convolutional Networks - This is a caffe implementation to visualize the learnt model

Split your patch similarly to `git add -p` but supporting multiple buckets

Think Big, Teach Small: Do Language Models Distil Occam’s Razor?

PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Code to compute permutation and drop-column importances in Python scikit-learn models

“英特尔创新大师杯”深度学习挑战赛 赛道3：CCKS2021中文NLP地址相关性任务

This is a repository for a Semantic Segmentation inference API using the Gluoncv CV toolkit

Deep Image Matting implementation in PyTorch

Web service for facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation based on OpenFace 2.0

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

Background Matting: The World is Your Green Screen

Experiments and code to generate the GINC small-scale in-context learning dataset from "An Explanation for In-context Learning as Implicit Bayesian Inference"

Classic Papers for Beginners and Impact Scope for Authors.

PySOT - SenseTime Research platform for single object tracking, implementing algorithms like SiamRPN and SiamMask.

“英特尔创新大师杯”深度学习挑战赛赛道3：CCKS2021中文NLP地址相关性任务