Tools for exploratory data analysis in Python

Overview

Dora

Exploratory data analysis toolkit for Python.

Contents

Summary

Dora is a Python library designed to automate the painful parts of exploratory data analysis.

The library contains convenience functions for data cleaning, feature selection & extraction, visualization, partitioning data for model validation, and versioning transformations of data.

The library uses and is intended to be a helpful addition to common Python data analysis tools such as pandas, scikit-learn, and matplotlib.

Setup

$ pip3 install Dora
$ python3
>>> from Dora import Dora

Usage

Reading Data & Configuration

# without initial config
>>> dora = Dora()
>>> dora.configure(output = 'A', data = 'path/to/data.csv')

# is the same as
>>> import pandas as pd
>>> dataframe = pd.read_csv('path/to/data.csv')
>>> dora = Dora(output = 'A', data = dataframe)

>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

Cleaning

# read data with missing and poorly scaled values
>>> import pandas as pd
>>> df = pd.DataFrame([
...   [1, 2, 100],
...   [2, None, 200],
...   [1, 6, None]
... ])
>>> dora = Dora(output = 0, data = df)
>>> dora.data
   0   1    2
0  1   2  100
1  2 NaN  200
2  1   6  NaN

# impute the missing values (using the average of each column)
>>> dora.impute_missing_values()
>>> dora.data
   0  1    2
0  1  2  100
1  2  4  200
2  1  6  150

# scale the values of the input variables (center to mean and scale to unit variance)
>>> dora.scale_input_values()
>>> dora.data
   0         1         2
0  1 -1.224745 -1.224745
1  2  0.000000  1.224745
2  1  1.224745  0.000000

Feature Selection & Extraction

# feature selection / removing a feature
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1

>>> dora.remove_feature('useless_feature')
>>> dora.data
   A   B  C      D
0  1   2  0   left
1  4 NaN  1  right
2  7   8  2   left

# extract an ordinal feature through one-hot encoding
>>> dora.extract_ordinal_feature('D')
>>> dora.data
   A   B  C  D=left  D=right
0  1   2  0       1        0
1  4 NaN  1       0        1
2  7   8  2       1        0

# extract a transformation of another feature
>>> dora.extract_feature('C', 'twoC', lambda x: x * 2)
>>> dora.data
   A   B  C  D=left  D=right  twoC
0  1   2  0       1        0     0
1  4 NaN  1       0        1     2
2  7   8  2       1        0     4

Visualization

# plot a single feature against the output variable
dora.plot_feature('column-name')

# render plots of each feature against the output variable
dora.explore()

Model Validation

# create random partition of training / validation data (~ 80/20 split)
dora.set_training_and_validation()

# train a model on the data
X = dora.training_data[dora.input_columns()]
y = dora.training_data[dora.output]

some_model.fit(X, y)

# validate the model
X = dora.validation_data[dora.input_columns()]
y = dora.validation_data[dora.output]

some_model.score(X, y)

Data Versioning

# save a version of your data
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> dora.snapshot('initial_data')

# keep track of changes to data
>>> dora.remove_feature('useless_feature')
>>> dora.extract_ordinal_feature('D')
>>> dora.impute_missing_values()
>>> dora.scale_input_values()
>>> dora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107

>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

# use a previous version of the data
>>> dora.snapshot('transform1')
>>> dora.use_snapshot('initial_data')
>>> dora.data
   A   B  C      D  useless_feature
0  1   2  0   left                1
1  4 NaN  1  right                1
2  7   8  2   left                1
>>> dora.logs
[]

# switch back to your transformation
>>> dora.use_snapshot('transform1')
>>> dora.data
   A         B         C    D=left   D=right
0  1 -1.224745 -1.224745  0.707107 -0.707107
1  4  0.000000  0.000000 -1.414214  1.414214
2  7  1.224745  1.224745  0.707107 -0.707107
>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']

Testing

To run the test suite, simply run python3 spec.py from the Dora directory.

Contribute

Pull requests welcome! Feature requests / bugs will be addressed through issues on this repository. While not every feature request will necessarily be handled by me, maintaining a record for interested contributors is useful.

Additionally, feel free to submit pull requests which add features or address bugs yourself.

License

The MIT License (MIT)

Copyright (c) 2016 Nathan Epstein

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Owner
Nathan Epstein
Nathan Epstein
Collection of data visualizing projects through Tableau, Data Wrapper, and Power BI

Data-Visualization-Projects Collection of data visualizing projects through Tableau, Data Wrapper, and Power BI Indigenous-Brands-Social-Movements Pyt

Jinwoo(Roy) Yoon 1 Feb 05, 2022
Using SQLite within Python to create database and analyze Starcraft 2 units data (Pandas also used)

SQLite python Starcraft 2 English This project shows the usage of SQLite with python. To create, modify and communicate with the SQLite database from

1 Dec 30, 2021
Seismic Waveform Inversion Toolbox-1.0

Seismic Waveform Inversion Toolbox (SWIT-1.0)

Haipeng Li 98 Dec 29, 2022
A dashboard built using Plotly-Dash for interactive visualization of Dex-connected individuals across the country.

Dashboard For The DexConnect Platform of Dexterity Global Working prototype submission for internship at Dexterity Global Group. Dashboard for real ti

Yashasvi Misra 2 Jun 15, 2021
Easily convert matplotlib plots from Python into interactive Leaflet web maps.

mplleaflet mplleaflet is a Python library that converts a matplotlib plot into a webpage containing a pannable, zoomable Leaflet map. It can also embe

Jacob Wasserman 502 Dec 28, 2022
A streamlit component for bi-directional communication with bokeh plots.

Streamlit Bokeh Events A streamlit component for bi-directional communication with bokeh plots. Its just a workaround till streamlit team releases sup

Ashish Shukla 123 Dec 25, 2022
Package managers visualization

Software Galaxies This repository combines visualizations of major software package managers. All visualizations are available here: http://anvaka.git

Andrei Kashcha 1.4k Dec 22, 2022
Typical: Fast, simple, & correct data-validation using Python 3 typing.

typical: Python's Typing Toolkit Introduction Typical is a library devoted to runtime analysis, inference, validation, and enforcement of Python types

Sean 171 Jan 02, 2023
Small project demonstrating the use of Grafana and InfluxDB for monitoring the speed of an internet connection

Speedtest monitor for Grafana A small project that allows internet speed monitoring using Grafana, InfluxDB 2 and Speedtest. Demo Requirements Docker

Joshua Ghali 3 Aug 06, 2021
Because trello only have payed options to generate a RunUp chart, this solves that!

Trello Runup Chart Generator The basic concept of the project is that Corello is pay-to-use and want to use Trello To-Do/Doing/Done automation with gi

RĂ´mulo Schiavon 1 Dec 21, 2021
Render Jupyter notebook in the terminal

jut - JUpyter notebook Terminal viewer. The command line tool view the IPython/Jupyter notebook in the terminal. Install pip install jut Usage $jut --

Kracekumar 169 Dec 27, 2022
Streamlit-template - A streamlit app template based on streamlit-option-menu

streamlit-template A streamlit app template for geospatial applications based on

Qiusheng Wu 41 Dec 10, 2022
Main repository for Vispy

VisPy: interactive scientific visualization in Python Main website: http://vispy.org VisPy is a high-performance interactive 2D/3D data visualization

vispy 3k Jan 03, 2023
Farhad Davaripour, Ph.D. 1 Jan 05, 2022
Easily configurable, chart dashboards from any arbitrary API endpoint. JSON config only

Flask JSONDash Easily configurable, chart dashboards from any arbitrary API endpoint. JSON config only. Ready to go. This project is a flask blueprint

Chris Tabor 3.3k Dec 31, 2022
script to generate HeN ipfs app exports of GLSL shaders

HeNerator A simple script to generate HeN ipfs app exports from any frag shader created with: GlslViewer GlslEditor The Book of Shaders glslCanvas VS

Patricio Gonzalez Vivo 22 Dec 21, 2022
GUI for visualization and interactive editing of SMPL-family body models ie. SMPL, SMPL-X, MANO, FLAME.

Body Model Visualizer Introduction This is a simple Open3D-based GUI for SMPL-family body models. This GUI lets you play with the shape, expression, a

Muhammed Kocabas 207 Jan 01, 2023
Create Badges with stats of Scratch User, Project and Studio. Use those badges in Github readmes, etc.

Scratch-Stats-Badge Create customized Badges with stats of Scratch User, Studio or Project. Use those badges in Github readmes, etc. Examples Document

Siddhesh Chavan 5 Aug 28, 2022
Application for viewing pokemon regional variants.

Pokemon Regional Variants Application Application for viewing pokemon regional variants. Run The Source Code Download Python https://www.python.org/do

Michael J Bailey 4 Oct 08, 2021
Set of matplotlib operations that are not trivial

Matplotlib Snippets This repository contains a set of matplotlib operations that are not trivial. Histograms Histogram with bins adapted to log scale

Raphael Meudec 1 Nov 15, 2021