Decision Tree Regression algorithm implemented on Python from scratch.

Last update: Dec 22, 2021

Overview

Decision_Tree_Regression

I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when the dataset is a curved line. The algorithm uses decision trees to generate multiple regression lines recursively. The training dataset is split into two parts in each iteration and a regression line is fit. The split is made at the best possible point to minimize the Mean Squared Error (MSE).

The number of regression lines is key. Overfitting occurs if the number is too high and underfitting occurs if the number is too low. There are two hyperparameters we use in this algorithm, maximum depth of the decision trees and the minimum number of samples in a single split. These parameters should be tested and optimized for each dataset.

Creating Datasets

Instead of using datasets downloaded from the internet, I decided to create my own datasets for this project. I generated 4 datasets to test my algorithm: Noisy Sinusoidal Signal, Noisy Second Degree Polynomial, Noisy Linear Line and Noisy Upside Down Triangle Signal. The program generates these datasets when its run and saves the datasets to recreate the results. To generate new datasets, you simply need to delete the first dataset, dataset0.csv file. You can also use your own datasets by uploading them to the same directory as the Python project.

Plotting Results

You can see the results of the sinusoidal signal and the upside down triangle for various hyperparameters. Colored points represent the splits in the training dataset, black lines represent the linear regression line for the corresponding split and the larger gray points represent the test dataset.

It is observed that for these datasets the best value for maximum depth is 4.

Decision Tree Regression algorithm implemented on Python from scratch.

Related tags

Overview

Decision_Tree_Regression

Creating Datasets

Plotting Results

Owner

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

A toolkit for making real world machine learning and data analysis applications in C++

InfiniteBoost: building infinite ensembles with gradient descent

This repository contains the code to predict house price using Linear Regression Method

scikit-multimodallearn is a Python package implementing algorithms multimodal data.

pandas, scikit-learn, xgboost and seaborn integration

CS 7301: Spring 2021 Course on Advanced Topics in Optimization in Machine Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

A repository for collating all the resources such as articles, blogs, papers, and books related to Bayesian Statistics.

A high-performance topological machine learning toolbox in Python

Official code for HH-VAEM

Hierarchical Time Series Forecasting using Prophet

Stacked Generalization (Ensemble Learning)

Titanic Traveller Survivability Prediction

Getting Profit and Loss Make Easy From Binance

Deploy AutoML as a service using Flask

Metric learning algorithms in Python

MaD GUI is a basis for graphical annotation and computational analysis of time series data.

A toolbox to iNNvestigate neural networks' predictions!