Random dataframe and database table generator

Overview

Random database/dataframe generator

Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA

Introduction

Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (.DB or .sqlite) for practicing SQL commands. Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one's own choice?

After all, databases break every now and then and it is safest to practice with a randomly generated one :-)

https://imgs.xkcd.com/comics/exploits_of_a_mom.png

While it is easy to generate random numbers or simple words for Pandas or dataframe operation learning, it is often non-trivial to generate full data tables with meaningful yet random entries of most commonly encountered fields in the world of database, such as

  • name,
  • age,
  • birthday,
  • credit card number,
  • SSN,
  • email id,
  • physical address,
  • company name,
  • job title,

This Python package generates a random database TABLE (or a Pandas dataframe, or an Excel file) based on user's choice of data types (database fields). User can specify the number of samples needed. One can also designate a "PRIMARY KEY" for the database table. Finally, the TABLE is inserted into a new or existing database file of user's choice.

https://raw.githubusercontent.com/tirthajyoti/pydbgen/master/images/Top_image_1.png

Dependency and Acknowledgement

At its core, pydbgen uses Faker as the default random data generating engine for most of the data types. Original function is written for few data types such as realistic email and license plate. Also the default phone number generated by Faker is free-format and does not correspond to US 10 digit format. Therefore, a simple phone number data type is introduced in pydbgen. The original contribution of pydbgen is to take the single data-generating function from Faker and use it cleverly to generate Pandas data series or dataframe or SQLite database tables as per the specification of the user. Here is the link if you want to look up more about Faker package,

Faker Documentation Home

Installation

(On Linux and Windows) You can use pip to install pydbgen:

pip install pydbgen

(On Mac OS), first install pip,

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

Then proceed as above.

Usage

Current version (1.0.0) of pydbgen comes with the following primary methods,

  • gen_data_series()
  • gen_dataframe()
  • gen_table()
  • gen_excel()

The gen_table() method allows you to build a database with as many tables as you want, filled with random data and fields of your choice. But first, you have to create an object of pydb class:

myDB = pydbgen.pydb()

gen_data_series()

Returns a Pandas series object with the desired number of entries and data type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Phone number can be of two types:

  • phone_number_simple generates 10 digit US number in xxx-xxx-xxxx format
  • phone_number_full may generate an international number with different format

Code example:

se=myDB.gen_data_series(data_type='date')
print(se)

0    1995-08-09
1    2001-08-01
2    1980-06-26
3    2018-02-18
4    1972-10-12
5    1983-11-12
6    1975-09-04
7    1970-11-01
8    1978-03-23
9    1976-06-03
dtype: object

gen_dataframe()

Generates a Pandas dataframe filled with random entries. User can specify the number of rows and data type of the fields/columns.

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.

Code example:

testdf=myDB.gen_dataframe(
25,fields=['name','city','phone',
'license_plate','email'],
real_email=True,phone_simple=True
)

gen_table()

Attempts to create a table in a database (.db) file using Python's built-in SQLite engine. User can specify various data types to be included as database table fields.

All data types (fields) in the SQLite table will be of VARCHAR type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • db_file: Name of the database where the TABLE will be created or updated. Default database name will be chosen if not specified by user.
  • table_name: Name of the table, to be chosen by user. Default table name will be chosen if not specified by user.
  • primarykey: User can choose a PRIMARY KEY from among the various fields. If nothing specified, the first data field will be made PRIMARY KEY. If user chooses a field, which is not in the specified list, an error will be thrown and no table will be generated.

Code example:

myDB.gen_table(
20,fields=['name','city','job_title','phone','company','email'],
db_file='TestDB.db',table_name='People',
primarykey='name',real_city=False
)

gen_excel()

Attempts to create an Excel file using Pandas excel_writer function. User can specify various data types to be included. All data types (fields) in the Excel file will be of text type. Data types available:

  • Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
  • Month, weekday, year, time, date
  • Personal email, official email, SSN
  • Company, Job title, phone number, license plate

Customization choices are following:

  • real_email: If True and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, Tirtha Sarkar name with this choice enabled, will generate emails like [email protected] or [email protected].
  • real_city: If True, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
  • phone_simple: If True, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
  • filename: Name of the Excel file to be created or updated. Default file name will be chosen if not specified by user.

Code example:

myDB.gen_excel(15,fields=['name','year','email','license_plate'],
        filename='TestExcel.xlsx',real_email=True)

Other auxiliary methods available

Few other auxiliary functions available in this package.

Owner
Tirthajyoti Sarkar
Data Sc/Engineering manager , Industry 4.0, edge-computing, semiconductor technologist, Author, Python pkgs - pydbgen, MLR, and doepy,
Tirthajyoti Sarkar
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilis

Blei Lab 4.7k Jan 09, 2023
An easy-to-use feature store

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ByteHub AI 48 Dec 09, 2022
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021
The repo for mlbtradetrees.com. Analyze any trade in baseball history!

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

7 Nov 20, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-

International Business Machines 293 Dec 29, 2022
EOD Historical Data Python Library (Unofficial)

EOD Historical Data Python Library (Unofficial) https://eodhistoricaldata.com Installation python3 -m pip install eodhistoricaldata Note Demo API key

Michael Whittle 20 Dec 22, 2022
This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot.

superSFS This is a tool for speculation of ancestral allel, calculation of sfs and drawing its bar plot. It is easy-to-use and runing fast. What you s

3 Dec 16, 2022
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021