Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Overview

DataTuner

You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task.

Installation

Environment Creation

Assuming you have an existing conda setup, you can setup the environment with the following script. In order to activate the conda environment within the bash script, you need the location of the conda.sh file:

bash setup.sh  ~/miniconda3/etc/profile.d/conda.sh

You can update your existing environment:

conda env update -f=environment.yml

To start development, activate your environment:

conda activate finetune

Alternatively, you can always use the python binary with the absolute path, e.g.: ~/miniconda3/envs/finetune/bin/python.

Data

For any task you want to fine-tune on, you need the data to be a json file containing a list of json objects, one per data point. For example:

[
  {
    "question": "question text 1",
    "query": "query 1"
  },
  {
    "question": "question text 2",
    "query": "query 2 with [SpecialToken example]"
  }
]

The library assumes that you have placed your data in a single directory with three files: train.json, validation.json, and test.json.

Configuration

Now that we have the data in shape, we need to create a new task configuration file that specifies how we want the data to be formatted and what fields should be considered. You can create new config files in the folder src/datatuner/lm/task_configs.

A typical config file would look as follows:

{
"name": "dataset_name",
"data_shape": [
        {
            "id": "<question>",
            "type": "special",
            "learn": false
        },
        {
            "id": "question",
            "type": "text",
            "learn": false
        },
        {
            "id": "<query>",
            "type": "special",
            "learn": false
        },
        {
            "id": "query",
            "type": "text",
            "learn": true,
            "metrics": [
                "match"
            ]
        }
    ],
"extra_special_tokens": ["[SpecialToken"],
"extra_fields": []
}

For each item in the data shape:

  • type (required): special if special token, text if normal text.
  • id (required): the special token ID if type is special; the key for the text in the json data if type is text
  • learn (required): whether to allow the model to learn this part of the text. If false, the model masks that part during fine-tuning.
  • metrics (optional): the list of metrics that the model should compute upon evaluation. Each metric should have a corresponding function with the same name in metrics.py.
  • converter (optional): the name of the converter function in converters.py to apply on that text field after reading the text from the file.

The value of extra_special_tokens is a list of special tokens to be added to the vocabulary. Alternatively (especially if the list is too long or is generated automatically), you can create a text file with one special token per line and pass that as an argument during training via the --special_tokens_file argument.

The value of extra_fields is a list of additional fields to include from the input json files to output during evaluation, aside from the main fields used as inputs/outputs.

Training

The training script train.py can be used in single GPU or multi GPU settings.

cd src/datatuner/lm

# single gpu
python train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

# multi gpu
python -m torch.distributed.launch --nproc_per_node=4 train.py --model_checkpoint ~/data/openai-gpt/  --dataset_path ../../../data/my_dataset/  --task_config ./task_configs/my_task_config.json --n_epoch 3 --lr 1e-5

Evaluating the Model

You can run the following to evaluate the model on any test set. The data format is the same as the training data. Notice that you have to currently specify the model_type parameter matching the model you're loading:

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt --top_k 1

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --filename ../../../data/my_dataset/test.json --max_length 200 --model_type gpt  --top_k 1

# or if you want to use the latest intermediate checkpoint while the model is training:
RUN=$(ls -t ./runs | head -1) && CHECKPOINT=$(ls -t ./runs/$RUN/checkpoint* | head -1) && cp $CHECKPOINT runs/$RUN/pytorch_model.bin

During evaluation, the outputs that do not exactly match the expected outputs will be printed. Also, the metrics will be printed (a dictionary with keys <metric_name>_<field_name>). At the end of evaluation, you will find the file with all the generated ouputs in the file eval_results/<run_folder_name>/<task_name>_<test_file_name>_<model_type>_generated.json.

Interacting with the model

You can also interact with the models. The client will ask you to input the fields required, and it will generate the fields it learnt.

cd src/datatuner/lm

python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/2020-01-01_01-01-01  --max_length 200 --model_type gpt  --top_k 1 --input

# or if you just want to evaluate the latest model you trained 
RUN=$(ls -t ./runs | head -1) && python ./evaluate.py --task_config ./task_configs/my_task_config.json --model_checkpoint runs/$RUN  --max_length 200 --model_type gpt  --top_k 1 --input
Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022
This is a implementation of CRAFT OCR method

This is a implementation of CRAFT OCR method

Esaka 0 Nov 01, 2021
Text recognition (optical character recognition) with deep learning methods.

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | paper | training and evaluation data | failure cases and cle

Clova AI Research 3.2k Jan 04, 2023
Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

OCRopus 3.2k Dec 31, 2022
Zoom , GoogleMeets에서 Vtuber 데뷔하기

EasyVtuber Facial landmark와 GAN을 이용한 Character Face Generation Google Meets, Zoom 등에서 자신만의 웹툰, 만화 캐릭터로 대화해보세요! 악세사리는 어느정도 추가해도 잘 작동해요! 안타깝게도 RTX 2070

Gunwoo Han 140 Dec 23, 2022
天池2021"全球人工智能技术创新大赛"【赛道一】:医学影像报告异常检测 - 第三名解决方案

天池2021"全球人工智能技术创新大赛"【赛道一】:医学影像报告异常检测 比赛链接 个人博客记录 目录结构 ├── final------------------------------------决赛方案PPT ├── preliminary_contest--------------------

19 Aug 17, 2022
SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition

SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition PDF Abstract Explainable artificial intelligence has been gaining attention

87 Dec 26, 2022
A python screen recorder for low-end computers, provides high quality video output.

RecorderX - v1.0 A screen recorder made in Python with the help of OpenCv, it has ability to record your screen in high quality. No matter what your P

Priyanshu Jindal 4 Nov 10, 2021
Fine tuning keras-ocr python package with custom synthetic dataset from scratch

OCR-Pipeline-with-Keras The keras-ocr package generally consists of two parts: a Detector and a Recognizer: Detector is responsible for creating bound

Eugene 1 Jan 05, 2022
Text to QR-CODE

QR CODE GENERATO USING PYTHON Author : RAFIK BOUDALIA. Installation Use the package manager pip to install foobar. pip install pyqrcode Usage from tki

Rafik Boudalia 2 Oct 13, 2021
Shape Detection - It's a shape detection project with OpenCV and Python.

Shape Detection It's a shape detection project with OpenCV and Python. Setup pip install opencv-python for doing AI things. pip install simpleaudio fo

1 Nov 26, 2022
Regions sanitàries (RS), Sectors Sanitàris (SS) i Àrees Bàsiques de Salut (ABS) de Catalunya

Regions sanitàries (RS), Sectors Sanitaris (SS), Àrees de Gestió Assistencial (AGA) i Àrees Bàsiques de Salut (ABS) de Catalunya Fitxers GeoJSON de le

Glòria Macià Muñoz 2 Jan 23, 2022
Hand gesture detection project with aweome UI implementation.

an awesome hand gesture detection project for you to be creative! Imagination is the limit to do with this project.

AR Ashraf 39 Sep 26, 2022
Read-only mirror of https://gitlab.gnome.org/GNOME/ocrfeeder

================================= OCRFeeder - A Complete OCR Suite ================================= OCRFeeder is a complete Optical Character Recogn

GNOME Github Mirror 81 Dec 23, 2022
Code for the paper: Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution

Fusformer Code for the paper: "Fusformer: A Transformer-based Fusion Approach for Hyperspectral Image Super-resolution" Plateform Python 3.8.5 + Pytor

Jin-Fan Hu (胡锦帆) 11 Dec 12, 2022
Convolutional Recurrent Neural Networks(CRNN) for Scene Text Recognition

CRNN_Tensorflow This is a TensorFlow implementation of a Deep Neural Network for scene text recognition. It is mainly based on the paper "An End-to-En

MaybeShewill-CV 1000 Dec 27, 2022
Fun program to overlay a mask to yourself using a webcam

Superhero Mask Overlay Description Simple project made for fun. It consists of placing a mask (a PNG image with transparent background) on your face.

KB Kwan 10 Dec 01, 2022
BoxToolBox is a simple python application built around the openCV library

BoxToolBox is a simple python application built around the openCV library. It is not a full featured application to guide you through the w

František Horínek 1 Nov 12, 2021
Sign Language Recognition service utilizing a deep learning model with Long Short-Term Memory to perform sign language recognition.

Sign Language Recognition Service This is a Sign Language Recognition service utilizing a deep learning model with Long Short-Term Memory to perform s

Martin Lønne 1 Jan 08, 2022
ocroseg - This is a deep learning model for page layout analysis / segmentation.

ocroseg This is a deep learning model for page layout analysis / segmentation. There are many different ways in which you can train and run it, but by

NVIDIA Research Projects 71 Dec 06, 2022