Re-implementation of 'Grokking: Generalization beyond overfitting on small algorithmic datasets'

Last update: Aug 09, 2022

Related tags

Deep Learning grokking

Overview

Re-implementation of the paper 'Grokking: Generalization beyond overfitting on small algorithmic datasets'

Paper

Original paper can be found here

Datasets

I'm not super clear on how they defined their division. I am using integer division:

$$x\circ y = (x // y) mod p$$, for some prime $$p$$ and $$0\leq x,y \leq p$$
$$x\circ y = (x // y) mod p$$ if y is odd else (x - y) mod p, for some prime $$p$$ and $$0\leq x,y \leq p$$

Hyperparameters

The default hyperparameters are from the paper, but can be adjusted via the command line when running train.py

Running experiments

To run with default settings, simply run python train.py. The first time you train on any dataset you have to specify --force_data.

Arguments:

optimizer args

"--lr", type=float, default=1e-3
"--weight_decay", type=float, default=1
"--beta1", type=float, default=0.9
"--beta2", type=float, default=0.98

model args

"--num_heads", type=int, default=4
"--layers", type=int, default=2
"--width", type=int, default=128

data args

"--data_name", type=str, default="perm", choices=[
- "perm_xy", # permutation composition x * y
- "perm_xyx1", # permutation composition x * y * x^-1
- "perm_xyx", # permutation composition x * y * x
- "plus", # x + y
- "minus", # x - y
- "div", # x / y
- "div_odd", # x / y if y is odd else x - y
- "x2y2", # x^2 + y^2
- "x2xyy2", # x^2 + y^2 + xy
- "x2xyy2x", # x^2 + y^2 + xy + x
- "x3xy", # x^3 + y
- "x3xy2y" # x^3 + xy^2 + y ]
"--num_elements", type=int, default=5 (choose 5 for permutation data, 97 for arithmetic data)
"--data_dir", type=str, default="./data"
"--force_data", action="store_true", help="Whether to force dataset creation."

training args

"--batch_size", type=int, default=512
"--steps", type=int, default=10**5
"--train_ratio", type=float, default=0.5
"--seed", type=int, default=42
"--verbose", action="store_true"
"--log_freq", type=int, default=10
"--num_workers", type=int, default=4

Re-implementation of 'Grokking: Generalization beyond overfitting on small algorithmic datasets'

Related tags

Overview

Re-implementation of the paper 'Grokking: Generalization beyond overfitting on small algorithmic datasets'

Paper

Datasets

Hyperparameters

Running experiments

Arguments:

optimizer args

model args

data args

training args

Owner

Tom Lieberum

Convert Python 3 code to CUDA code.

ConE: Cone Embeddings for Multi-Hop Reasoning over Knowledge Graphs

The MLOps platform for innovators 🚀

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

This is the second place solution for : UmojaHack Africa 2022: African Snake Antivenom Binding Challenge

Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

Ranking Models in Unlabeled New Environments （iccv21）

TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

Nest - A flexible tool for building and sharing deep learning modules

Prototypical Networks for Few shot Learning in PyTorch

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

JAX bindings to the Flatiron Institute Non-uniform Fast Fourier Transform (FINUFFT) library

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

A collection of models for image<->text generation in ACM MM 2021.

Cancer metastasis detection with neural conditional random field (NCRF)

Course materials for Fall 2021 "CIS6930 Topics in Computing for Data Science" at New College of Florida

[ICCV 2021] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

Deep Face Recognition in PyTorch

Python Implementation of algorithms in Graph Mining, e.g., Recommendation, Collaborative Filtering, Community Detection, Spectral Clustering, Modularity Maximization, co-authorship networks.