A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
The official Repository wherein newbies into Open Source can Contribute during the Hacktoberfest 2021

Hacktoberfest 2021 Get Started With your first Contrinution/Pull Request : Fork/Copy the repo by clicking the right most button on top of the page. Go

HacOkars 25 Aug 20, 2022
Python 101 Forever

🚀 Python 101 Forever 🚀 Official Python 101 Forever GitHub repository. START HERE - CHECK README SUBSCRIBE FOR UPDATES HERE Sponsors Contac

Hack Bulgaria 58 Nov 30, 2022
Hack CMU Go Local Project

GoLocal A submission for the annual HackCMU Hackathon. We built a website which connects shopper with local businesses. The goal is to drive consumers

2 Oct 02, 2021
Random Programming Language Project

Crastle Random Programming Language Project Freedom of expression Are you a fan of curly brace languages? Then use curly braces! Not a fan of curly br

DevNugget 2 Dec 23, 2021
Necst-lib - Pure Python tools for NECST

necst-lib Pure Python tools for NECST. Features This library provides: something

NANTEN2 Group 5 Dec 15, 2022
A nonebot2 plugin, send news information in a picture form.

A nonebot2 plugin, send news information in a picture form.

幼稚园园长 7 Nov 18, 2022
A Company Management System For Python

campany-management Getting started To make it easy for you to get started with GitLab, here's a list of recommended next steps. Already a pro? Just ed

hatice akpınar 3 Aug 29, 2022
A Kodi add-on for watching content hosted on PeerTube.

A Kodi add-on for watching content hosted on PeerTube. This add-on is under development so only basic features work, and you're welcome to improve it.

1 Dec 18, 2021
Eatlocal - This package helps users solve PyBites code challenges on their local machine

eatlocal This package helps the user solve Pybites code challenges locally. Inst

Russell 0 Jul 25, 2022
sfgp is a package that aggregates individual scripts and notebooks, primarily written for the basic analysis tasks of genetics and pharmacogenomics data.

sfgp is a package that aggregates individual scripts and notebooks, primarily written for the basic analysis tasks of genetics and pharmacogenomics data.

Vishal Sarsani 1 Mar 31, 2022
A webapp for taking fast notes, designed for business, school, and collaboration with groups.

JOTS Journal of the Session A webapp for taking fast notes, designed for business, school, and collaboration with groups.

Zebadiah S. Taylor 2 Jun 10, 2022
This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable.

GA-beautifier-iOS This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable. To run it, create a

Rafael Machado 3 Feb 02, 2022
validation for pre-commit.ci configuration

pre-commit-ci-config validation for pre-commit.ci configuration installation pip install pre-commit-ci-config api pre_commit_ci_config.SCHEMA a cfgv s

pre-commit.ci 17 Jul 11, 2022
A simple and efficient computing package for Genshin Impact gacha analysis

GGanalysisLite计算包 这个版本的计算包追求计算速度,而GGanalysis包有着更多计算功能。 GGanalysisLite包通过卷积计算分布列,通过FFT和快速幂加速卷积计算。 测试玩家得到的排名值rank的数学意义是:与抽了同样数量五星的其他玩家相比,测试玩家花费的抽数大于等于比例

一棵平衡树 34 Nov 26, 2022
Simple, configuration-driven backup software for servers and workstations

title permalink borgmatic index.html It's your data. Keep it that way. borgmatic is simple, configuration-driven backup software for servers and works

borgmatic collective 1.3k Dec 30, 2022
Information about a signed UEFI Shell that can be used when Secure Boot is enabled.

SignedUEFIShell During our research of the BootHole vulnerability last year, we tried to find as many signed bootloaders as we could. We searched all

Mickey 61 Jan 03, 2023
Algorand Python API examples

Algorand-Py Algorand Python API examples This repo will hold example scripts to monitor activities on Algorand main net. You can: Monitor your assets

Karthik Dutt 2 Jan 23, 2022
Automatización del proceso Inmofianza

Selenium Inmofianza Proyecto de pruebas automatizadas con selenium webdriver para el aplicativo Omnicanalidad Pre-requisitos 📋 Componentes que deben

Natalia Narváez 1 Jan 07, 2022
Pytorch implementation of "Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates"

Peer Loss functions This repository is the (Multi-Class & Deep Learning) Pytorch implementation of "Peer Loss Functions: Learning from Noisy Labels wi

Kushal Shingote 1 Feb 08, 2022
A simple python script where the user inputs the current ingredients they have in their kitchen into ingredients.txt

A simple python script where the user inputs the current ingredients they have in their kitchen into ingredients.txt and then runs the main.py script, and it will output what recipes can be created b

Jordan Leich 3 Nov 02, 2022