Generating new names based on trends in data using GPT2 (Transformer network)

Last update: Jan 10, 2022

Related tags

Overview

MLOpsNameGenerator

Overall Goal

The goal of the project is to develop a model that is capable of creating Pokémon names based on its description, using principles orginization and version control, reproduceability, etc.

Framework

The framework we use is Transformer. We intend to use the Natural Language Processing (NLP) part of the framework. The model we are going to use is GPT-2 doing finetuning over it so we can specialize it over our precise problem.

Data

Initially, we pretend to use the description of each Pokémon using the PokéAPI, which is a RESTful API linked to a database of details of Pokémon.

Relevant querys to the API:

Obtain the list of all Pokémon:

https://pokeapi.co/api/v2/pokedex/national

Get the description of each Pokémon:

https://pokeapi.co/api/v2/pokemon-species/{PKMN_SPECIE_NUMBER}

Commands

make requirements: Installs all requirements from requirements.txt.
make devrequirements: Installs additional dependencies for development.
make datafolders: Creates folders for the data in the project (data/raw, data/processed, data/external and data/interim)
make data: Downloads and process the data.
make clean: Deletes compiled Python files
make train: Trains model
make deploy: Uploads the updates cleaning and fixing style

RoadMap

Week 1

Goal of this week is to setup the project. This includes: Setting up the makefile, setting up the first model and a script for training the model, fetching the data required to train the models, setting up hydra to test with hyperparameters and setting up docker for containerization.

Alba	Alejandro	Gustav
Data obtaining and processing	Test usage of GPT-2	Develop model using GPT-2
Hydra and config. files	Review and change structure of the train script	-
Add wandb to log training progress	Do predict script	-

Week 2

Week3

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Cites and references

PokéAPI

Movie name generation with GPT-2

Huggingface transformers

Huggingface notebooks

NameKrea An AI That Generates Domain Names

DTU Course 02476 - Machine Learning Operations

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Generating new names based on trends in data using GPT2 (Transformer network)

Related tags

Overview

MLOpsNameGenerator

Overall Goal

Framework

Data

Commands

RoadMap

Week 1

Week 2

Week3

Project Organization

Cites and references

Owner

Gustav Lang Moesmand

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

基于pytorch+bert的中文事件抽取

A BERT-based reverse-dictionary of Korean proverbs

Textpipe: clean and extract metadata from text

Spert NLP Relation Extraction API deployed with torchserve for inference

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Share constant definitions between programming languages and make your constants constant again

ADCS cert template modification and ACL enumeration

a chinese segment base on crf

Prompt tuning toolkit for GPT-2 and GPT-Neo

Behavioral Testing of Clinical NLP Models

Codes for coreference-aware machine reading comprehension

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

The swas programming language

Datasets of Automatic Keyphrase Extraction

Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)