Notebook and code to synthesize complex and highly dimensional datasets using Gretel APIs.

Last update: Nov 03, 2022

Related tags

Overview

Gretel Trainer

This code is designed to help users successfully train synthetic models on complex datasets with high row and column counts. The code works by intelligently dividing a dataset into a set of smaller datasets of correlated columns that can be parallelized and then joined together.

Get Started

Running the notebook

Launch the Notebook in Google Colab or your preferred environment.
Add your dataset and Gretel API key to the notebook.
Generate synthetic data!

NOTE: Either delete the existing or choose a new cache file name if you are starting a dataset run from scratch.

TODOs / Roadmap

Enable additional sampling from from trained models.
Detect and label encode random UIDs (preprocessing).

Comments

Benchmark route Amplify models through Trainer
Top level change

Now that Trainer has a GretelAmplify model, Benchmark uses Trainer for Amplify runs instead of the SDK.

Refactor

I refactored Benchmark's Gretel models and executors with the goal of centralizing and thus making it simpler to understand:

which model types use Trainer (opt-in) vs. use the SDK

the "compatibility requirements" for different models (currently: LSTM <= 150 columns, GPTX == 1 column)

These had been spread across a few different places (compare.py determined Trainer/SDK, gretel/sdk.py had GPTX compatibility, gretel/trainer.py had LSTM compatibility), but now it can all be found in gretel/models.py.

At first glance it would seem compatibility requirements could be defined on specific model subclasses to make things more polymorphic. However, Benchmark's Gretel model classes are really just friendly wrappers around specific model configurations (from the blueprints repo) and do not represent all possible instances of that model type running through Benchmark. Instead, we instruct users subclass the generic GretelModel base class when they want to provide their own specific Gretel configuration. There are two reasons for this:

It's a simpler instruction (always subclass this one thing)

It enables us to include model types that are not yet "first class supported," such as DGAN (which we can't support in the same way we do models like Amplify/LSTM/etc. because DGAN's config includes required fields that are specifically coupled to the data source—there is no "one size fits all" blueprint).

Small fixes

fix the model_slug value for Trainer's GretelACTGAN model

:warning: should this be changed to a list ["actgan", "ctgan"] for a little while for a smoother transition/deprecation experience??

zero-index custom model runs' run-identifier to match gretel model runs (which were themselves fixed to match project names here)
opened by mikeknep 2
Lift gretel model compatibility to separate module
What's here

Make it easier to find the "compatibility rules" for models by lifting the logic to its own module.

Why not add this logic to the specific model classes? Wouldn't that be more polymorphic?

The model classes (GretelLSTM, GretelCTGAN, etc.) are wrappers around specific configurations from the blueprints repo. They do not represent every possible configuration of that model type. If a user wants to run a customized LSTM config, for example, they subclass GretelModel, not GretelLSTM:

class MyLstm(GretelModel): config = "/path/to/my_lstm.yml"

Note: they could subclass GretelLSTM, but 1) it's easier to tell people to just subclass GretelModel regardless of model type, and/because 2) this ultimately treats the model configuration as the source of truth.

If someone mistakenly created a custom Gretel model like this...

class MyGptX(GretelGPTX): config = "/path/to/my_amplify.yml"

...Benchmark will treat this as an Amplify model, because basically all it does with the class instance is grab the config attribute (and the name—the results output will show the name as MyGptX.)
opened by mikeknep 1
Lr/artifact manifest
Added logic for config selection and updated dictionary key to access manifest per latest internal changes.

Note that high-dimensionality-high-record is non-existent at the moment, as is the manifest endpoint :)

Items yet to be addressed:

turn off partitions for non-LSTM models
opened by lipikaramaswamy 1
Add param to pass custom base configuration
Prefer config if present, otherwise use the model_type's default config.

This does open the door a little wider to setting an invalid config that won't be known to be bad until attempting to train. That door was already slightly ajar in that one could use model_params to set keys to invalid values.

Not included here, but a thought: we could validate model_type earlier (even as the very first step of __init__) to fail fast, specifically before even creating a project.
opened by mikeknep 1
Remove no-op elif case from runner

Particularly given that we now have a third model (Amplify) supported in Trainer, we can remove this no-op elif clause so that the runner only has special logic for / awareness of LSTM (expand up in the diff for context).

opened by mikeknep 0
Switch CTGAN usages to ACTGAN.

ACTGAN is the successor of CTGAN.

Note (1): this change is backward compatible, as all of the parameters that CTGAN supported are supported by ACTGAN as well.

Note (2): any previously trained CTGAN models will be still usable, i.e. it will be possible to generate new records using old CTGAN models.

opened by pimlock 0
Fix off-by-one difference between project name and run ID

Quick fix so that benchmark's internal run identifier lines up with the project name in Gretel Cloud. We'll eventually have a more user-friendly and stable interface to access detailed run information, but until we figure out how exactly we want that to look and do it, this should make things a little more friendly for those willing to dive into the internals: the models from project benchmark-{timestamp}-3 will correspond to comparison.results_dict["gretel-3"] (instead of "gretel-4")

Note: I considered just using the full project name as the identifier instead of gretel-{index}, but we don't have an equivalent to project names for user custom model runs, so I figure the current [gretel|custom]-{index} approach is still best for now.

opened by mikeknep 0
Configure session before starting Benchmark comparison

Current behavior

When running in an environment where no Gretel credentials can be found (e.g. Colab), when Benchmark kicks off a comparison the background threads instantiating Trainer instances will prompt for an API key. This is problematic for multiple reasons, all (I believe) due to it running in multiple background threads: it prompts multiple times, doesn't accept input and/or cache properly, and ultimately crashes.

This fix

Benchmark itself now checks for a configured session before kicking off any real work. It prompts (api_key="prompt") if no credentials are found, validates (validate=True) the supplied API key, and caches (cache="yes") it for all the runs it manages. The configure_session calls that happen when instantiating Trainer effectively "pass through." I've tested this by installing trainer from this branch in Colab and it is now working as expected.

opened by mikeknep 0
Include dataset name in trainer uploads.

Add original file name to data sources uploaded as part of trainer projects. This helps disambiguate the data sources from multiple trainer runs where previously they were always named trainer_0.csv, trainer_1.csv, etc.

Also fixes StrategyRunner to not silently swallow all ApiExceptions when submitting a job, so errors not associated with max job limit are still thrown and surfaced to the user.

opened by kboyd 0
Auto-determine best model from training data

Rather than create a GretelAuto model class that would need to override or work around several _BaseConfig details (validation, max/limit values, etc.), my goal here is to establish the convention that model type is optional and if you don't specify one when instantiating the Trainer, you're OK with us choosing for you. This is a change from the current behavior (optional but default to LSTM). In this case, we defer setting the trainer instance's self.model_type until such time as we can determine the best model to use: namely, at train time when a dataset has been provided.

I'm a little unclear on the load (from cache) workflow, which in this branch's implementation would set the StrategyRunner's model_config to None. I think this is OK because the only methods referencing that value are part of training (train_all_partitions => train_next_partition => train_partition), and that workflow is only kicked off by the Trainer's train method, which will load in data and use it to determine and set a concrete model.

I've also added an optional delimiter parameter to train to help support files with non-comma delimiters.

opened by mikeknep 0
Get average sqs score from across partitions

A few ways we could slice and dice this; I figure there may be additional SQS info we want from the run in the future so I decided to expose the entire List[dict] from the runner, and let the trainer pluck out and calculate the first such aggregate, user-friendly data. I'm open to pushing more of this down to the runner and/or transforming the SQS dictionaries into first-class types (likely dataclasses) if anyone has a strong opinion or thinks it'd be useful.

opened by mikeknep 0
Use artifact manifest for determine_best_model.

Not fully tested. Waiting for new backend API to be available.

Should revisit retry logic if we can reliably distinguish between a pending manifest (still being generated) and some other error. Or if retrying is included in the gretel_client interface.

opened by kboyd 1

Releases(v0.5.0)

v0.5.0(Nov 18, 2022)
What's Changed

GretelCTGAN has been completely removed, fully replaced by its successor, GretelACTGAN

GretelACTGAN uses the new tabular-actgan config by default

Benchmark now routes Amplify models through Trainer rather than the SDK

Bug fix: helper to properly configure Gretel session before starting Benchmark comparison when unset

Bug fix: zero-index Benchmark run ID (internal) to fix off-by-one difference with project name

Full Changelog: https://github.com/gretelai/trainer/compare/v0.4.1...v0.5.0
Source code(tar.gz)
Source code(zip)
v0.4.1(Nov 2, 2022)
What's Changed

Add pip install command and Colab disclaimer to Benchmark notebook by @mikeknep in https://github.com/gretelai/trainer/pull/22

Include dataset name in trainer uploads. by @kboyd in https://github.com/gretelai/trainer/pull/21

Docs improvements by @MasonEgger (https://github.com/gretelai/trainer/pull/23 https://github.com/gretelai/trainer/pull/24 https://github.com/gretelai/trainer/pull/28 https://github.com/gretelai/trainer/pull/26)

Add support for Gretel Amplify by @pimlock in https://github.com/gretelai/trainer/pull/29

New Contributors

@kboyd made their first contribution in https://github.com/gretelai/trainer/pull/21

@MasonEgger made their first contribution in https://github.com/gretelai/trainer/pull/23

@pimlock made their first contribution in https://github.com/gretelai/trainer/pull/29

Full Changelog: https://github.com/gretelai/trainer/compare/v0.4.0...v0.4.1
Source code(tar.gz)
Source code(zip)
v0.4.0(Oct 6, 2022)
What's Changed

Initial release of new Benchmark module :rocket: by @mikeknep in https://github.com/gretelai/trainer/pull/19

Create simple-conditional-generation.ipynb :notebook: by @zredlined in https://github.com/gretelai/trainer/pull/18

Full Changelog: https://github.com/gretelai/trainer/compare/v0.3.0...v0.4.0
Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 30, 2022)

📖 Apply Gretel's Source Available License
Source code(tar.gz)
Source code(zip)
v0.2.3(Aug 24, 2022)
What's Changed

The trainer now chooses the best model configuration based on input training data when model_type is not specified in advance at Trainer instantiation (previously defaulted to GretelLSTM)

train accepts an optional delimiter argument (defaults to comma when unspecified)

Input training data is divided more equally across row partitions

LSTM models generate a consistent number of records (5000) during data training (previously matched size of input training data)

Fixed trainer generate to synthesize the correct number of records when multiple row partitions are used

Fixed trainer get_sqs_score method

Full Changelog: https://github.com/gretelai/trainer/compare/v0.2.2...v0.2.3
Source code(tar.gz)
Source code(zip)
v0.2.2(Aug 11, 2022)
What's Changed

Update default model config by @zredlined in https://github.com/gretelai/trainer/pull/10

Remove project delete instruction by @drew in https://github.com/gretelai/trainer/pull/11

CTGAN and conditional data generation by @zredlined in https://github.com/gretelai/trainer/pull/12

Get average sqs score from across partitions by @mikeknep in https://github.com/gretelai/trainer/pull/14

Full Changelog: https://github.com/gretelai/trainer/compare/v0.2.1...v0.2.2
Source code(tar.gz)
Source code(zip)
v0.2.1(Jun 16, 2022)

This release fixes imports in the SDK notebook examples and adds a param to enable privacy filtering.
Source code(tar.gz)
Source code(zip)
v0.2.0(Jun 10, 2022)

Publishing gretel-trainer to PyPi
Source code(tar.gz)
Source code(zip)
v0.1.0(Jun 10, 2022)

🚀 New release for the gretel-trainer module
Source code(tar.gz)
Source code(zip)

Owner

Gretel.ai

Gretel.ai Open Source Projects and Tools

GitHub Repository

Very deep VAEs in JAX/Flax

Very Deep VAEs in JAX/Flax Implementation of the experiments in the paper Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on I

42 Dec 12, 2022

State of the art Semantic Sentence Embeddings

Contrastive Tension State of the art Semantic Sentence Embeddings Published Paper · Huggingface Models · Report Bug Overview This is the official code

88 Dec 30, 2022

PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

PASTRIE Official release of the corpus described in the paper: Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schn

4 Dec 02, 2021

Algo-burn - Script to configure an Algorand address as a "burn" address for one or more ASA tokens

Algorand Burn Address This is a simple script to illustrate how a "burn address"

5 May 10, 2022

Spectral Tensor Train Parameterization of Deep Learning Layers

Spectral Tensor Train Parameterization of Deep Learning Layers This repository is the official implementation of our AISTATS 2021 paper titled "Spectr

12 Oct 23, 2022

An official implementation of "Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation" (CVPR 2021) in PyTorch.

BANA This is the implementation of the paper "Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation". For more inf

59 Dec 12, 2022

Public repo for the ICCV2021-CVAMD paper "Is it Time to Replace CNNs with Transformers for Medical Images?"

Is it Time to Replace CNNs with Transformers for Medical Images? Accepted at ICCV-2021: Workshop on Computer Vision for Automated Medical Diagnosis (C

80 Dec 27, 2022

Energy consumption estimation utilities for Jetson-based platforms

This repository contains a utility for measuring energy consumption when running various programs in NVIDIA Jetson-based platforms. Currently TX-2, NX, and AGX are supported.

10 Jun 17, 2022

This is a student data management application developed in Python and TKinter. It utilizes the TKinter pillow library to include images to buttons. I've separated TKinter elements into their own individual classes. The user can change the smilely face color for each button individually or by entire row.

Smiley Face Cube Display Table of Contents Project Description Getting Started Prerequisites Installation & Deployment Additional Documentation Projec

0 Aug 04, 2021

The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

HCQ_Tweet_Dataset: FREE to Download. Keywords: HCQ, hydroxychloroquine, tweet, twitter, COVID-19 This dataset is associated with the paper "Understand

2 Mar 16, 2022

labelpix is a graphical image labeling interface for drawing bounding boxes

Welcome to labelpix 👋 labelpix is a graphical image labeling interface for drawing bounding boxes. 🏠 Homepage Install pip install -r requirements.tx

26 May 24, 2022

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow. They have a parallel sampling feature in order to increase computation speed (especially in high-performance computing (HPC)).

3 Dec 28, 2021

This thesis is mainly concerned with state-space methods for a class of deep Gaussian process (DGP) regression problems

Doctoral dissertation of Zheng Zhao This thesis is mainly concerned with state-space methods for a class of deep Gaussian process (DGP) regression pro

21 Nov 14, 2022

Code for Environment Inference for Invariant Learning (ICML 2020 UDL Workshop Paper)

Environment Inference for Invariant Learning This code accompanies the paper Environment Inference for Invariant Learning, which appears at ICML 2021.

40 Dec 09, 2022

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Skeleton Aware Multi-modal Sign Language Recognition By Songyao Jiang, Bin Sun, Lichen Wang, Yue Bai, Kunpeng Li and Yun Fu. Smile Lab @ Northeastern

128 Dec 08, 2022

Official Pytorch implementation of 'RoI Tanh-polar Transformer Network for Face Parsing in the Wild.'

125 Jan 08, 2023

Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

TSOD Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer" Usage For training, open train_test, run p

2 Dec 23, 2021

Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images"

GANInversion_with_ConsecutiveImgs Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images" https://a

38 Dec 07, 2022

Python library for computer vision labeling tasks. The core functionality is to translate bounding box annotations between different formats-for example, from coco to yolo.

PyLabel pip install pylabel PyLabel is a Python package to help you prepare image datasets for computer vision models including PyTorch and YOLOv5. I

176 Jan 01, 2023

Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift

This repository contains the official code of OSTAR in "Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift" (ICLR 2022).

5 Dec 06, 2022

Notebook and code to synthesize complex and highly dimensional datasets using Gretel APIs.

Related tags

Overview

Gretel Trainer

Get Started

Running the notebook

TODOs / Roadmap

Comments

Top level change

Refactor

Small fixes

What's here

Why not add this logic to the specific model classes? Wouldn't that be more polymorphic?

Current behavior

This fix

Releases(v0.5.0)

v0.5.0(Nov 18, 2022)

What's Changed

v0.4.1(Nov 2, 2022)

What's Changed

New Contributors

v0.4.0(Oct 6, 2022)

What's Changed

v0.3.0(Aug 30, 2022)

v0.2.3(Aug 24, 2022)

What's Changed

v0.2.2(Aug 11, 2022)

What's Changed

v0.2.1(Jun 16, 2022)

v0.2.0(Jun 10, 2022)

v0.1.0(Jun 10, 2022)

Owner

Gretel.ai

Very deep VAEs in JAX/Flax

State of the art Semantic Sentence Embeddings

PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Algo-burn - Script to configure an Algorand address as a "burn" address for one or more ASA tokens

Spectral Tensor Train Parameterization of Deep Learning Layers

An official implementation of "Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation" (CVPR 2021) in PyTorch.

Public repo for the ICCV2021-CVAMD paper "Is it Time to Replace CNNs with Transformers for Medical Images?"

Energy consumption estimation utilities for Jetson-based platforms

The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

labelpix is a graphical image labeling interface for drawing bounding boxes

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

This thesis is mainly concerned with state-space methods for a class of deep Gaussian process (DGP) regression problems

Code for Environment Inference for Invariant Learning (ICML 2020 UDL Workshop Paper)

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Official Pytorch implementation of 'RoI Tanh-polar Transformer Network for Face Parsing in the Wild.'

Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

Official code for our ICCV paper: "From Continuity to Editability: Inverting GANs with Consecutive Images"

Python library for computer vision labeling tasks. The core functionality is to translate bounding box annotations between different formats-for example, from coco to yolo.

Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift