Semi-Automated Data Processing

Overview

Semi-Automated Data Processing

Preparing data for model learning is one of the most important steps in any project—and traditionally, one of the most time consuming. Data Analysis plays a very important role in the entire Data Science Workflow. In fact, this takes most of the time of the Data science Workflow. There’s a nice quote (not sure who said it)According to Wikipedia, In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.**“In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.”*

This projects handles the task with minimal user interaction by analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the project in semi-interactive fashion, previewing the changes before they are made and accept or reject them as you want.

This project cover the 3 steps in any project workflow, comes before the model training:
1) Exploratory data analysis
2) Feature engineering
3) Feature selection


All these steps has to be carried out by the user by calling the several functions as follows:

1) identify_feature(data)=
This function identifies the categorical, continuous numerical and discrete numerical features in the datset. It also identifies datetime feature and extracts the relevant info from it.

Input:
data=Dataset

Output:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

2) plot_nan_feature(data, continuous_features, discrete_features, categorical_features,dependent_var)=
It identifies the missing values in the dataset and visualize them their impact on dependent feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
dependent_var= Dependent feature name in string format

Output:
df= Dataset
nan_features= List of feature names containing NaN values

3) visualize_imputation_impact(data,continuous_features, discrete_features, categorical_features,nan_features,dependent_var):
The function visualizes the impact of different NaN value impution on the distribution of values the feature.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
categorical_features=List of features names associated containing categorical values
nan_features= List of feature names containing NaN values
dependent_var= Dependent feature name in string format

Output:
None

4) nan_imputation(data,mean_feature,median_feature,mode_feature,random_feature,new_category):
The function imputes the NaN values in the feature as per the user input.

Input:
data=Dataset
mean_feature= List of feature names in which we have to carry out mean_imputation
median_feature=List of feature names in which we have to carry out median_imputation
mode_feature=List of feature names in which we have to carry out mode_imputation
random_feature=List of feature names in which we have to carry out random_imputation
new_category=List of feature names in which we we create a new category for the NaN values

Output:
None

5) cross_visualization(data,continuous_features,discrete_features, categorical_features,dt_features):
The function visualise the relationship between the different independent features.

Input:
df=Dataset
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values

Output:
continuous_features2=List of features names associated containing continuous numerical values, except the dependent feature

6) dependent_independent_visualization(data,continuous_features,discrete_features, categorical_features,dt_features,dependent_feature):
The function visualise the relationship between the different independent features.

Input:
data_cont_num_feature= List of features names associated containing continuous numerical values
data_dis_num_feature=List of features names associated containing discrete numerical values
data_cat_feature=List of features names associated containing categorical values
dt_feature=List of features names associated containing datetime values
dependent_var= Dependent feature name in string format

Output:
None

7) outlier_removal(data,continuous_features,discrete_features,dependent_var,dependent_var_type,action):
The function visualizes the outlliers using the boxplot and removes them.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_var= Dependent feature name in string format
dependent_var_type= Contain string tells if the problem is regression (than use 'Regression') or else
action= Give input as 'remove' to delete the rows associated with the outliers

Output:
df=Dataset

8) transformation_visualization(data,continuous_features,discrete_features,dependent_feature):
The function visualize the feature after performing various transormation techniques.

Input:
data=Dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
dependent_feature= Dependent feature name in string format

Output:
None

9) feature_transformation(train_data,continuous_features,discrete_features,transformation,dependent_feature):
The function performing the feature transormation technique as per the user input.

Input:
train_data=Training dataset
continuous_features= List of features names associated containing continuous numerical values
discrete_features=List of features names associated containing discrete numerical values
transformation=Type of transformation: none=No transformation, log=Log Transformation, sqrt= Square root Transformation, reciprocal= Reciprocal Transformation, exp= Exponential Transformation, boxcox=Boxcox Transformation
dependent_feature= Dependent feature name in string format

Output:
X_data=Training dataset

10) categorical_transformation(train_data,categorical_encoding):
This function transforms the categorical featres in the numerical ones using encoding techniques.

Input:
train_data=Training dataset
categorical_encoding={'one_hot_encoding':[],'frequency_encoding':[],'mean_encoding':[],'target_guided_ordinal_encoding':{}}

Output:
X_data=Training dataset

11a) feature_selection(Xtrain,ytrain, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
feature_df= Dataframe containig features with their pvalue

11b) feature_selection(Xtrain,ytrain,Xtest,ytest, threshold, data_type, filter_type):
This function performs the feature selection based on the dependent and independent features in train dataset.

Input:
Xtrain=Training dataset
ytrain=dependent data in training dataset
Xtest=Test dataset
ytest=dependent data in test dataset
threshold= Threshold for the correlation
{'in_num_out_num':{'linear':['pearson'],'non-linear':['spearman']},
'in_num_out_cat':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_num':{'linear':['ANOVA'],'non-linear':['kendall']},
'in_cat_out_cat':{'chi_square_test':True,'mutual_info':True},}
data_type= Data linear or non-linearly dependent on the output label
filter_type= If input data is numerical and output is numerical then --'in_num_out_num' as shown in the above dictionary

Output:
Xtrain= Training dataset
Xtest= Test dataset
feature_df= Dataframe containig features with their pvalue

12) convert_dtype(data,categorical_features):
This function converts the categorical fetaures containing the numeric values but presented as categorical into the int format.

Input:
data= Dataset
categorical_features=List of features names associated containing categorical values

Output:
df=Dataset

Note:
Use same paramters for both train and test dataset for better accuracy


We have implemented a bike sharing project to describe how the functions can be used for both the classification and regression problem statement.

Owner
Arun Singh Babal
Engineer | Data Science Enthusiasts | Machine Learning | Deep Learning | Advanced Computer Vision.
Arun Singh Babal
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
ped-crash-techvol: Texas Ped Crash Tech Volume Pack

ped-crash-techvol: Texas Ped Crash Tech Volume Pack In conjunction with the Final Report "Identifying Risk Factors that Lead to Increase in Fatal Pede

Network Modeling Center; Center for Transportation Research; The University of Texas at Austin 2 Sep 28, 2022
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

Derek Snow 374 Jan 07, 2023
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

Andrew Tavis McAllister 35 Jan 04, 2023
Techdegree Data Analysis Project 2

Basketball Team Stats Tool In this project you will be writing a program that reads from the "constants" data (PLAYERS and TEAMS) in constants.py. Thi

2 Oct 23, 2021
talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

David Cournapeau 76 Nov 30, 2022
INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

Yann-Aël Le Borgne 58 Dec 11, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,

Kevin Schaich 12 Dec 24, 2022
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to

Derek Snow 527 Jan 02, 2023
.npy, .npz, .mtx converter.

npy-converter Matrix Data Converter. Expand matrix for multi-thread, multi-process Divid matrix for multi-thread, multi-process Support: .mtx, .npy, .

taka 1 Feb 07, 2022
MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

Living with Machines 25 Dec 26, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 06, 2021
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

6 Oct 11, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 05, 2022