Full-featured Decision Trees and Random Forests learner.

Last update: Aug 15, 2022

Overview

CID3

This is a full-featured Decision Trees and Random Forests learner. It can save trees or forests to disk for later use. It is possible to query trees and Random Forests and to fill out an unlabeled file with the predicted classes. Documentation is not yet available, although the program options can be shown with command:

% java -jar cid3.jar -h

usage: java -jar cid3.jar
 -a,--analysis <name>    show causal analysis report
 -c,--criteria <name>    input criteria: c[Certainty], e[Entropy], g[Gini]
 -f,--file <name>        input file
 -h,--help               print this message
 -o,--output <name>      output file
 -p,--partition          partition train/test data
 -q,--query <type>       query model, enter: t[Tree] or r[Random forest]
 -r,--forest <amount>    create random forest, enter # of trees
 -s,--save               save tree/random forest
 -t,--threads <amount>   maximum number of threads (default is 500)
 -v,--validation         create 10-fold cross-validation
 -ver,--version          version

List of features

It uses a new Certainty formula as splitting criteria.
Provides causal analysis report, which shows how some attribute values cause a particular classification.
Creates full trees, showing error rates for train and test data, attribute importance, causes and false positives/negatives.
If no test data is provided, it can split the train dataset in 80% for training and 20% for testing.
Creates random forests, showing error rates for train and test data, attribute importance, causes and false positives/negatives. Random forests are created in parallel, so it is very fast.
Creates 10 Fold Cross-Validation for trees and random forests, showing error rates, mean and Standard Error and false positives/negatives. Cross-Validation folds are created in parallel.
Saves trees and random forests to disk in a compressed file. (E.g. model.tree, model.forest)
Query trees and random forest from saved files. Queries can contain missing values, just enter the character: “?”.
Make predictions and fill out cases files with those predictions, either from single trees or random forests.
Missing values imputation for train and test data is implemented. Continuous attributes are imputed as the mean value. Discrete attributes are imputed as MODE, which selects the value that is most frequent.
Ignoring attributes is implemented. In the .names file just set the attribute type as: ignore.
Three different splitting criteria can be used: Certainty, Entropy and Gini. If no criteria is invoked then Certainty will be used.

Example run with titanic dataset

[email protected] datasets % java -jar cid3.jar -f titanic

CID3 [Version 1.1]              Saturday October 30, 2021 06:34:11 AM
------------------
[ ✓ ] Read data: 891 cases for training. (10 attributes)
[ ✓ ] Decision tree created.

Rules: 276
Nodes: 514

Importance Cause   Attribute Name
---------- -----   --------------
      0.57   yes ············ Sex
      0.36   yes ········· Pclass
      0.30   yes ··········· Fare
      0.28   yes ······· Embarked
      0.27   yes ·········· SibSp
      0.26   yes ·········· Parch
      0.23    no ············ Age


[==== TRAIN DATA ====] 

Correct guesses:  875
Incorrect guesses: 16 (1.8%)

# Of Cases  False Pos  False Neg   Class
----------  ---------  ---------   -----
       549         14          2 ····· 0
       342          2         14 ····· 1

Time: 0:00:00

Requirements

CID3 requires JDK 15 or higher.

The data format is similar to that of C4.5 and C5.0. The data file format is CSV, and it could be split in two separated files, like: titanic.data and titanic.test. The class attribute column must be the last column of the file. The other necessary file is the "names" file, which should be named like: titanic.names, and it contains the names and types of the attributes. The first line is the class attribute possible values. This line could be left empty with just a dot(.) Below is an example of the titanic.names file:

0,1.  
PassengerId: ignore.  
Pclass: 1,2,3.  
Sex : male,female.  
Age: continuous.  
SibSp: discrete.  
Parch: discrete.  
Ticket: ignore.  
Fare: continuous.  
Cabin: ignore.  
Embarked: discrete.

Example of causal analysis

% java -jar cid3.jar -f adult -a education

From this example we can see that attribute "education" is a cause, which is based on the certainty-raising inequality. Once we know that it is a cause we then compare the causal certainties of its values. When it's value is "Doctorate" it causes the earnings to be greater than $50,000, with a probability of 0.73. A paper will soon be published with all the formulas used to calculate the Certainty for splitting the nodes and the certainty-raising inequality, used for causal analysis.

Importance Cause   Attribute Name
---------- -----   --------------
      0.56   yes ······ education

Report of causal certainties
----------------------------

[ Attribute: education ]

    1st-4th --> <=50K  (0.97)

    5th-6th --> <=50K  (0.95)

    7th-8th --> <=50K  (0.94)

    9th --> <=50K  (0.95)

    10th --> <=50K  (0.94)

    11th --> <=50K  (0.95)

    12th --> <=50K  (0.93)

    Assoc-acdm --> <=50K  (0.74)

    Assoc-voc --> <=50K  (0.75)

    Bachelors --> Non cause.

    Doctorate --> >50K  (0.73)

    HS-grad --> <=50K  (0.84)

    Masters --> >50K  (0.55)

    Preschool --> <=50K  (0.99)

    Prof-school --> >50K  (0.74)

    Some-college --> <=50K  (0.81)

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

Fixed a bug when entering an attribute name for causal analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.3(Mar 10, 2022)

Implemented progress animation when option -s is invoked.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.2(Mar 2, 2022)

Added progress animation to the analysis report.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2.1(Jan 21, 2022)

Replaced a problematic character.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.2(Nov 9, 2021)

This version includes de correct calculation of causal certainties and the certainty raising inequality. Also the analysis report is sorted by attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.5(Nov 7, 2021)

Implemented correctly the causal analysis, using the certainty-raising inequality and the causal certainties.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.3(Nov 7, 2021)

Implemented causes for specific attribute values.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.2(Nov 6, 2021)

Minor patch.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1.1(Oct 31, 2021)

This is a hurried patch to fix a problem in the causal analysis report. Now the report works as it was intended.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.1(Oct 30, 2021)

Release v1.1 contains many new features and fixes. Implemented report of causal certainties, which allows to see how certain attribute values cause a particular classification.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.7(Oct 28, 2021)

Code cleanup and new features implemented. When querying a tree now checks for invalid input and asks for correct input. This will be the last patch until version v1.1
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.6(Oct 28, 2021)

Correctly aligned text on console.
Source code(tar.gz)
Source code(zip)
cid3.jar(4.72 MB)
v1.0.5(Oct 27, 2021)

Reintroduced attribute importance for Entropy and Gini criteria.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.4(Oct 27, 2021)

Removed causal analysis from Entropy and Gini criteria. It only makes sense with Certainty.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.3(Oct 23, 2021)

Rolled back the parallel tests of Random Forests. It is much faster now.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.2(Oct 23, 2021)

Minor changes.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0.1(Oct 23, 2021)

Now testing Random Forests is done in parallel.
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)
v1.0(Oct 18, 2021)

Releasing version v1.0
Source code(tar.gz)
Source code(zip)
cid3.jar(5.62 MB)

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

gHHC Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, D

35 Nov 16, 2022

A python library to build Model Trees with Linear Models at the leaves.

212 Dec 30, 2022

Full-featured Decision Trees and Random Forests learner.

Related tags

Overview

CID3

List of features

Example run with titanic dataset

Requirements

Example of causal analysis

You might also like...

Full body anonymization - Realistic Full-Body Anonymization with Surface-Guided GANs

Random-Afg - Afghanistan Random Old Idz Cloner Tools

ElegantRL is featured with lightweight, efficient and stable, for researchers and practitioners.

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

Simulate genealogical trees and genomic sequence data using population genetic models

TreeSubstitutionCipher - Encryption system based on trees and substitution

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

A python library to build Model Trees with Linear Models at the leaves.

Releases(v1.2.4)

v1.2.4(Apr 28, 2022)

v1.2.3(Mar 10, 2022)

v1.2.2(Mar 2, 2022)

v1.2.1(Jan 21, 2022)

v1.2(Nov 9, 2021)

v1.1.5(Nov 7, 2021)

v1.1.3(Nov 7, 2021)

v1.1.2(Nov 6, 2021)

v1.1.1(Oct 31, 2021)

v1.1(Oct 30, 2021)

v1.0.7(Oct 28, 2021)

v1.0.6(Oct 28, 2021)

v1.0.5(Oct 27, 2021)

v1.0.4(Oct 27, 2021)

v1.0.3(Oct 23, 2021)

v1.0.2(Oct 23, 2021)

v1.0.1(Oct 23, 2021)

v1.0(Oct 18, 2021)

Owner

Alejandro Penate-Diaz

Bling's Object detection tool

Point Cloud Registration Network

mbrl-lib is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms.

Codes for NAACL 2021 Paper "Unsupervised Multi-hop Question Answering by Question Generation"

A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

The authors' official PyTorch SigWGAN implementation

Official repository for "Exploiting Session Information in BERT-based Session-aware Sequential Recommendation", SIGIR 2022 short.

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Code for the published paper : Learning to recognize rare traffic sign

[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

Paper: De-rendering Stylized Texts

Largest list of models for Core ML (for iOS 11+)

PyTorch implementations for our SIGGRAPH 2021 paper: Editable Free-viewpoint Video Using a Layered Neural Representation.

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Dynamical movement primitives (DMPs), probabilistic movement primitives (ProMPs), spatially coupled bimanual DMPs.

MEDS: Enhancing Memory Error Detection for Large-Scale Applications

An interactive DNN Model deployed on web that predicts the chance of heart failure for a patient with an accuracy of 98%