A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

Last update: Dec 18, 2021

Related tags

Overview

FEATURE ENGINEERING

Business Problem: A data preprocessing and feature engineering script for a machine learning pipeline needs to be prepared. It is expected that the dataset will be ready for modelling when passed through this script.

Story of the Dataset:
The dataset is the dataset of the people who were in the Titanic shipwreck. It consists of 768 observations and 12 variables. The target variable is specified as "Survived";

0: indicates the person's inability to survive.

1: refers to the survival of the person.

ATTRIBUTES:

PassengerId: ID of the passenger

Survived: Survival status (0: not survived, 1: survived)

Pclass: Ticket class (1: 1st class (upper), 2: 2nd class (middle), 3: 3rd class(lower))

Name: Name of the passenger

Sex: Gender of the passenger (male, female)

Age: Age in years

Sibsp: Number of siblings/spouses aboard the Titanic
Sibling = Brother, sister, stepbrother, stepsister
Spouse = Husband, wife (mistresses and fiances were ignored)

Parch: Number of parents/children aboard the Titanic
Parent = Mother, father
Child = Daughter, son, stepdaughter, stepson
Some children travelled only with a nanny , therefore Parch = 0 for them.

Ticket: Ticket number # Fare: Passenger fare

Cabin: Cabin number

Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

REFERENCE: Data Science and ML Boot Camp, 2021, Veri Bilimi Okulu (https://www.veribilimiokulu.com/)

A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

Related tags

Overview

Owner

Pinar Oner

It is a forest of random projection trees

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.

A simple and lightweight genetic algorithm for optimization of any machine learning model

Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Management of exclusive GPU access for distributed machine learning workloads

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

决策树分类与回归模型的实现和可视化

MLBox is a powerful Automated Machine Learning python library.

Stats, linear algebra and einops for xarray

pandas, scikit-learn, xgboost and seaborn integration

Machine learning that just works, for effortless production applications

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

Machine Learning from Scratch

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

pywFM is a Python wrapper for Steffen Rendle's factorization machines library libFM

Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

BASTA: The BAyesian STellar Algorithm

Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Simple data balancing baselines for worst-group-accuracy benchmarks.