Code for the paper "There is no Double-Descent in Random Forests"

This repository contains the code to run the experiments for our paper called "There is no Double-Descent in Random Forest". In the paper we highlight experiments on the 5 different datasets (adult, bank, eeg, magic, nomao), but this implementation also supports more datasets out of the box. Most of the code should be somewhat commented and self-explanatory given the two caveats below. To run the experiments simply clone this repository.

(Optional) Build the conda environment and activate it:

conda env creat -f environment.yml --force
conda activate rfdd

Run experiments on the adult dataset with M = 256 trees over a 5 fold cross validation with different number of max_nodes with 96 threads:

./run.py -x 5 -M 256 --n_jobs 96 --max_nodes 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 -d adult

Important 1: This will run all experiments with 96 threads. The experiments are executed in a multiprocessing.Pool environment which means that the entire dataset is copied for each cross-validation run. Hence this may take a decent amount of memory (up to 200GB) and some time.

Important 2: The command-line argument n_jobs only determines the total number of threads in the processing pool, but not the total number of threads used by this script. We currently supply n_jobs = n_jobs_per_forest = None to scikit-learns RandomForestClassifier when fitting the (initial) RF. Hence, scikit-learn uses a heuristic to choose the number of jobs used for fitting the RF. If required, then you can set n_jobs_per_forest in the script manually (line 132).

Important 3: Datasets which are not found in the tempfolder (issued by tempfile.gettmpdir() which likely points to /tmp on Linux systems) are automatically downloaded. If you have already downloaded the datasets or you simply do not like the temp folder you can set this via --tmpdir ${your_new_tmp_dir}.

Plot the results on the adult dataset and store the them in the current folder:

./plot.py -d adult -o .

Alternativley, plot.py is also divided into execution cells which you can run via an inline interpreter (e.g. VSCode or a Juypter Notebook).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
Metrics.py		Metrics.py
Models.py		Models.py
NCRandomForestClassifier.py		NCRandomForestClassifier.py
Readme.md		Readme.md
datasets.py		datasets.py
environment.yml		environment.yml
old_stuff.py		old_stuff.py
plot.py		plot.py
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Metrics.py

Metrics.py

Models.py

Models.py

NCRandomForestClassifier.py

NCRandomForestClassifier.py

Readme.md

Readme.md

datasets.py

datasets.py

environment.yml

environment.yml

old_stuff.py

old_stuff.py

plot.py

plot.py

run.py

run.py

Repository files navigation

Code for the paper "There is no Double-Descent in Random Forests"

About

Releases

Packages

Languages

sbuschjaeger/rf-double-descent

Folders and files

Latest commit

History

Repository files navigation

Code for the paper "There is no Double-Descent in Random Forests"

About

Resources

Stars

Watchers

Forks

Languages