Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Overview

gpt-2-simple

gen_demo

A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifically the "small" 124M and "medium" 355M hyperparameter versions). Additionally, this package allows easier generation of text, generating to a file for easy curation, allowing for prefixes to force the text to start with a given phrase.

This package incorporates and makes minimal low-level changes to:

  • Model management from OpenAI's official GPT-2 repo (MIT License)
  • Model finetuning from Neil Shepperd's fork of GPT-2 (MIT License)
  • Text generation output management from textgenrnn (MIT License / also created by me)

For finetuning, it is strongly recommended to use a GPU, although you can generate using a CPU (albeit much more slowly). If you are training in the cloud, using a Colaboratory notebook or a Google Compute Engine VM w/ the TensorFlow Deep Learning image is strongly recommended. (as the GPT-2 model is hosted on GCP)

You can use gpt-2-simple to retrain a model using a GPU for free in this Colaboratory notebook, which also demos additional features of the package.

Install

gpt-2-simple can be installed via PyPI:

pip3 install gpt-2-simple

You will also need to install the corresponding TensorFlow for your system (e.g. tensorflow or tensorflow-gpu). TensorFlow 2.0 is currently not supported and the package will throw an assertion if loaded, so TensorFlow 1.14/1.15 is recommended.

Usage

An example for downloading the model to the local system, finetuning it on a dataset. and generating some text.

Warning: the pretrained 124M model, and thus any finetuned model, is 500 MB! (the pretrained 355M model is 1.5 GB)

import gpt_2_simple as gpt2
import os
import requests

model_name = "124M"
if not os.path.isdir(os.path.join("models", model_name)):
	print(f"Downloading {model_name} model...")
	gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/124M/


file_name = "shakespeare.txt"
if not os.path.isfile(file_name):
	url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
	data = requests.get(url)
	
	with open(file_name, 'w') as f:
		f.write(data.text)
    

sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              file_name,
              model_name=model_name,
              steps=1000)   # steps is max number of training steps

gpt2.generate(sess)

The generated model checkpoints are by default in /checkpoint/run1. If you want to load a model from that folder and generate text from it:

import gpt_2_simple as gpt2

sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)

gpt2.generate(sess)

As with textgenrnn, you can generate and save text for later use (e.g. an API or a bot) by using the return_as_list parameter.

single_text = gpt2.generate(sess, return_as_list=True)[0]
print(single_text)

You can pass a run_name parameter to finetune and load_gpt2 if you want to store/load multiple models in a checkpoint folder.

There is also a command-line interface for both finetuning and generation with strong defaults for just running on a Cloud VM w/ GPU. For finetuning (which will also download the model if not present):

gpt_2_simple finetune shakespeare.txt

And for generation, which generates texts to files in a gen folder:

gpt_2_simple generate

Most of the same parameters available in the functions are available as CLI arguments, e.g.:

gpt_2_simple generate --temperature 1.0 --nsamples 20 --batch_size 20 --length 50 --prefix "<|startoftext|>" --truncate "<|endoftext|>" --include_prefix False --nfiles 5

See below to see what some of the CLI arguments do.

NB: Restart the Python session first if you want to finetune on another dataset or load another model.

Differences Between gpt-2-simple And Other Text Generation Utilities

The method GPT-2 uses to generate text is slightly different than those like other packages like textgenrnn (specifically, generating the full text sequence purely in the GPU and decoding it later), which cannot easily be fixed without hacking the underlying model code. As a result:

  • In general, GPT-2 is better at maintaining context over its entire generation length, making it good for generating conversational text. The text is also generally gramatically correct, with proper capitalization and few typoes.
  • The original GPT-2 model was trained on a very large variety of sources, allowing the model to incorporate idioms not seen in the input text.
  • GPT-2 can only generate a maximum of 1024 tokens per request (about 3-4 paragraphs of English text).
  • GPT-2 cannot stop early upon reaching a specific end token. (workaround: pass the truncate parameter to a generate function to only collect text until a specified end token. You may want to reduce length appropriately.)
  • Higher temperatures work better (e.g. 0.7 - 1.0) to generate more interesting text, while other frameworks work better between 0.2 - 0.5.
  • When finetuning GPT-2, it has no sense of the beginning or end of a document within a larger text. You'll need to use a bespoke character sequence to indicate the beginning and end of a document. Then while generating, you can specify a prefix targeting the beginning token sequences, and a truncate targeting the end token sequence. You can also set include_prefix=False to discard the prefix token while generating (e.g. if it's something unwanted like <|startoftext|>).
  • If you pass a single-column .csv file to finetune(), it will automatically parse the CSV into a format ideal for training with GPT-2 (including prepending <|startoftext|> and suffixing <|endoftext|> to every text document, so the truncate tricks above are helpful when generating output). This is necessary to handle both quotes and newlines in each text document correctly.
  • GPT-2 allows you to generate texts in parallel by setting a batch_size that is divisible into nsamples, resulting in much faster generation. Works very well with a GPU (can set batch_size up to 20 on Colaboratory's K80)!
  • Due to GPT-2's architecture, it scales up nicely with more powerful GPUs. For the 124M model, if you want to train for longer periods of time, GCP's P100 GPU is about 3x faster than a K80/T4 for only 3x the price, making it price-comparable (the V100 is about 1.5x faster than the P100 but about 2x the price). The P100 uses 100% of the GPU even with batch_size=1, and about 88% of the V100 GPU.
  • If you have a partially-trained GPT-2 model and want to continue finetuning it, you can set overwrite=True to finetune, which will continue training and remove the previous iteration of the model without creating a duplicate copy. This can be especially useful for transfer learning (e.g. heavily finetune GPT-2 on one dataset, then finetune on other dataset to get a "merging" of both datasets).
  • If your input text dataset is massive (>100 MB), you may want to preencode and compress the dataset using gpt2.encode_dataset(file_path). THe output is a compressed .npz file which will load much faster into the GPU for finetuning.
  • The 774M "large" model may support finetuning because it will cause modern GPUs to go out-of-memory (you may get lucky if you use a P100 GPU on Colaboratory). However, you can still generate from the default pretrained model using gpt2.load_gpt2(sess, model_name='774M') and gpt2.generate(sess, model_name='774M').
  • The 1558M "extra large", true model, may not work out-of-the-box with the GPU included with the Colaboratory Notebook. More testing is needed to identify optimial configurations for it.

Interactive Apps Using gpt-2-simple

  • gpt2-small — App using the default GPT-2 124M pretrained model
  • gpt2-reddit — App to generate Reddit titles based on a specified subreddit and/or keyword(s)
  • gpt2-mtg — App to generate Magic: The Gathering cards

Text Generation Examples Using gpt-2-simple

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

Disclaimer

This repo has no affiliation or relationship with OpenAI.

Issues
  • Supporting Tensorflow 2

    Supporting Tensorflow 2

    Hello! The folks at @yaledhlab are huge fans of your work @minimaxir. I wanted to send on the first draft of a pull request to make the library function in Tensorflow 1 or 2.

    The only feature not ported to tf2 here is the memory saving gradients logic. The upstream library from which that logic is adopted still hasn't been ported to tf2, and some of the underlying graph traversal methods have been removed from tensorflow itself in 2.x, so there would be a good bit of work getting this running.

    That said, the memory saving gradients are simply a performance enhancement. For interested parties, there's a thread going in @cybertronai/gradient-checkpointing and another thread in the @tensorflow repo that discusses some decorater-based approaches to gradient checkpointing in tf2...

    In any event, thanks for this great work!

    opened by duhaime 19
  • Finetune Json Issue

    Finetune Json Issue

    When running any of the text files I created the program complains about the following issue regardless of the text file.

    sess = gpt2.start_tf_sess() gpt2.finetune(sess, 'java_train_java.txt', model_name=model_name, steps=1000) # steps is max number of training steps

    JSONDecodeError Traceback (most recent call last) in 6 'java_train_java.txt', 7 model_name=model_name, ----> 8 steps=1000) # steps is max number of training steps

    ~/.local/lib/python3.5/site-packages/gpt_2_simple/gpt_2.py in finetune(sess, dataset, steps, model_name, model_dir, combine, batch_size, learning_rate, accumulate_gradients, restore_from, run_name, checkpoint_dir, sample_every, sample_length, sample_num, save_every, print_every, max_checkpoints, use_memory_saving_gradients, only_train_transformer_layers, overwrite) 153 raise(fnf_error) 154 --> 155 enc = encoder.get_encoder(checkpoint_path) 156 hparams = model.default_hparams() 157 with open(os.path.join(checkpoint_path, 'hparams.json')) as f:

    ~/.local/lib/python3.5/site-packages/gpt_2_simple/src/encoder.py in get_encoder(checkpoint_path) 108 def get_encoder(checkpoint_path): 109 with open(os.path.join(checkpoint_path, 'encoder.json'), 'r') as f: --> 110 encoder = json.load(f) 111 with open(os.path.join(checkpoint_path, 'vocab.bpe'), 'r', encoding="utf-8") as f: 112 bpe_data = f.read()

    /usr/lib/python3.5/json/init.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 266 cls=cls, object_hook=object_hook, 267 parse_float=parse_float, parse_int=parse_int, --> 268 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) 269 270

    /usr/lib/python3.5/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 317 parse_int is None and parse_float is None and 318 parse_constant is None and object_pairs_hook is None and not kw): --> 319 return _default_decoder.decode(s) 320 if cls is None: 321 cls = JSONDecoder

    /usr/lib/python3.5/json/decoder.py in decode(self, s, _w) 337 338 """ --> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 340 end = _w(s, end).end() 341 if end != len(s):

    /usr/lib/python3.5/json/decoder.py in raw_decode(self, s, idx) 355 obj, end = self.scan_once(s, idx) 356 except StopIteration as err: --> 357 raise JSONDecodeError("Expecting value", s, err.value) from None 358 return obj, end

    JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    opened by Rafcin 12
  • Improved Sampling (Nucleus Sampling)

    Improved Sampling (Nucleus Sampling)

    In a recent paper, they find that beam search actually does even worse than greedy search for GPT-2 117M, and they propose a new sampling method to improve text output.

    Neural Text Degradation

    From what I understand, it's top-k sampling, except instead of k tokens, it takes all tokens until probability of these tokens combined is > some parameter p.

    Here's an example from the paper:

    🗣Context: On days when he woke early, the president liked to have coffee in the oval office. There was something about watching the sky grow lighter and lighter as you sat your pajama'd behind in the most powerful chair in the free world, sipping marine strength coffee, that you just couldn't achieve anywhere else.


    💰Greedy: He'd been in the oval office for over a year now, and he'd never been so happy to be there. He'd been in the oval office for almost a year, and he'd never been so happy to be there.


    🚦BeamSearch (k=4): The President was a man of his word. He was a man of his word. He was a man of his word. He was a man of his word. He was a man of his word. He was a man of his word.


    🥄Sampling (t=1.0): You couldn't be sure if that's what you were really doing, and If you decided to take the day off. The president wanted you to take the day off, but he wanted to maintain a curfew and use his influence wisely.


    🇰Top-k Sampling (k=8): "I'm sorry, sir." "No, it's okay, son," the president said. "I understand." "You're going to have to make a special trip down there to get that kid. He has no idea where he's going."


    ⚛️Nucleus Sampling (p=0.9): But that wasn't what drew the president's attention. He'd been seated for maybe a minute when he noticed the other man. What was the guy doing here?


    🗣Gold: He was therefore disagreeably surprised to find a man in an understated grey suit sitting in that selfsame chair sipping tea. The president turned around and went looking for his chief of staff.

    opened by bob80333 11
  • text generation quality for Chinese

    text generation quality for Chinese

    I use colab try to train a chinese novel , but result is not actually readable as below: ======== SAMPLE 1 ======== 是将吹雾挖出的将成功探测而出。令得她不过来了。 在这句她却便是张了什么处。可地云岚宗的基地本就有猜测的唶回纳,若是可见开一些他们纗地图下的威风。有什么那些自抗成形功探也是被实亀约不运的失踪,这个家伙?” “按一东西。 “以后?” 第一千两百四纳乎容其收获 双翼下午床 偂静以及云岚宗家族时,现在穿过尮层地死死一死的一位完全自人。若是被这位似乎么好。不过这些层地曘众而速的缘故。先前云岚宗家族与家伙破碎,也知道。” “按这些年边按一东西。” 双危得枯落双成一些纸藏。现在云岚宗家族。则是有着更是珋地的纳戒。一名一名视线成功探而来。似此如同一股落地被月地位置身给在落地墓墓吼。将三色山峰。都是在她身处的族人而出。他们。能够如何丧门两个家族事。我没有丝毫。比较给云岚宗家族身族成功压渐了过来二人。” 落地最后对此刻低低的落地。这些人吼力地双更驰在云岚宗这般种有些做完的同局一段时间。就在山脉交手吸了一圈。他们仅仅是将会从丝毫地毒间。那家伙。拥有会难以过足有山脉路。想必地实力。可怕的毒间不会速助引。” 心中现在也算是连脸色。纳戒的一道道人影击杀着自指大会回底独血之人。一个了。能够击

    my training parameter is ... gpt2.finetune(sess, dataset="train.txt", model_name='345M', steps=1000, restore_from='fresh', print_every=20, sample_every=200, save_every=500)

    Since GPT-2 should be very powerful for text generation, I just want to make sure this quality result is normal or I still have something not figure out yet.

    Thnx

    opened by chiangandy 8
  • Multi-gpu support

    Multi-gpu support

    • Added automated gpu name gathering and model partitioning across multiple GPUs
    • Added boolean multi-gpu signature to finetune and load_gpt2 functions.
    • Added CLI option for multi_gpu

    Note: if layer == 10: in the model was removed so all layers are checkpointed. This can be reverted.

    opened by huntrontrakkr 7
  • GPT2 Chatbot

    GPT2 Chatbot

    Hi, I'm new to this stuff, and I'm trying to make a chatbot out of gpt2 using finetune... My question is, can I make gpt remember stuff like this:

    Q: Hey, I'm Joe. A: Hey, Joe. Q: What's my name? A:Joe.

    Right now, it can barely remember anything I write to it. Do I just need a better more dialogue-focused dataset to finetune it on or is there something else I can do to make it remember? I'm using the colab notebook.

    Thanks, I'm rather new to ML, so...

    opened by ZeroMaxinumXZ 7
  • Not able to load the dataset

    Not able to load the dataset

    I have been trying to train the 117M model, with the dataset of size 1.03 GB, with 64 GB ram. But while it load the dataset, it remain stuck there. And after some 30 min, its just terminate. Here is the log.

    Fetching checkpoint: 1.00kit [00:00, 679kit/s]                                                      
    Fetching encoder.json: 1.04Mit [00:00, 16.5Mit/s]                                                   
    Fetching hparams.json: 1.00kit [00:00, 573kit/s]                                                    
    Fetching model.ckpt.data-00000-of-00001:  11%|#8               | 53.6M/498M [00:00<00:07, 62.2Mit/s]
    Fetching model.ckpt.data-00000-of-00001:  28%|#####3             | 141M/498M [00:01<00:03, 105Mit/s]
    Fetching model.ckpt.data-00000-of-00001:  46%|########7          | 230M/498M [00:02<00:02, 108Mit/s]
    Fetching model.ckpt.data-00000-of-00001:  63%|###########4      | 316M/498M [00:03<00:02, 66.6Mit/s]
    Fetching model.ckpt.data-00000-of-00001:  77%|#############8    | 384M/498M [00:04<00:01, 58.8Mit/s]
    Fetching model.ckpt.data-00000-of-00001:  92%|################6 | 460M/498M [00:06<00:00, 44.8Mit/s]
    Fetching model.ckpt.data-00000-of-00001: 498Mit [00:06, 72.4Mit/s]                                  
    Fetching model.ckpt.index: 6.00kit [00:00, 3.39Mit/s]                                               
    Fetching model.ckpt.meta: 472kit [00:00, 9.86Mit/s]                                                 
    Fetching vocab.bpe: 457kit [00:00, 9.54Mit/s]                                                       2019-05-19 16:12:23.408514: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instr
    uctions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    
      0%|          | 0/1 [00:00<?, ?it/s]
    

    I also saw another issue, which ask to cut the text file. How much has to be ideal size in order to train. If not, what model size could go with 1 GB text file ?

    Help will be appreciated 👍

    opened by nvnvashisth 7
  • 0 tokens when attempting to finetune using .txt

    0 tokens when attempting to finetune using .txt

    Using the collaboratory (have not tried locally), I tried loading a normal text file and it found 0 tokens.

    I've found that splitting on whitespace and turning my source file into a csv was the only way to get past this.

    All of the examples reference "shakespeare.txt" but that file isn't included in the repo so I have not been able to confirm what the tool is expecting from a plaintext file.

    opened by lukegalea 7
  • Fails to load dataset on Windows due to text encoding

    Fails to load dataset on Windows due to text encoding

    On Windows 10, when attempting to run this code on my own dataset, I run into this error:

    return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2243891: character maps to <undefined>

    Adding the parameter encoding='utf8' to line 33 of load_dataset.py: with open(path, 'r') as fp: appears to fix this issue. I'm not 100% sure, because now instead of erroring out immediately, it's using as much RAM as it can

    Screen Shot 2019-04-20 at 6 48 10 PM

    I can open a pull request for this if you'd like.

    bug 
    opened by bob80333 7
  • Question about loss calculation

    Question about loss calculation

    Hi all,

    I've been reading the code, trying to understand the implementation, but I have a question about the loss function that possibly is kinda dumb.

    Suppose a target text: target="The fox went to the forest fast". Also suppose an input sample context=[The, fox, went, to, the, forest].

    The loss is calculated as:

    loss = tf.reduce_mean(
            input_tensor=tf.nn.sparse_softmax_cross_entropy_with_logits(
                labels=context[:, 1:], logits=output['logits'][:, :-1]))
    

    As far as I know, the goal is to predict the nth token given all the previous ones (i.e. "fast", given the context). However, if we set labels as [fox, went, to, the, forest], and the logits as the [0, n-1] tokens, how can the model know the target for the last token (n)? Aren't we teaching the model to output the padded context and leaving as random the last token?

    Sorry for the inconvennience.

    opened by AIRLegend 6
  • Resolved most deprecation warnings

    Resolved most deprecation warnings

    I hate hiding warnings/errors but these ones were driving me crazy, so I went through and fixed up all the deprecation warnings I could find. There's still two (that I found) out there that I couldn't figure out how to fix, but this clears up most of them.

    opened by charliekmorris 6
  • How to train on text with fewer than 1,024 tokens?

    How to train on text with fewer than 1,024 tokens?

    Is it possible to override gpt-2-simple and train a model on text containing fewer than 1,024 tokens? I understand the quality won't be Olympian, but for some of my data that's all I have.

    opened by Mennaruuk 0
  • Impossible to finetune model bigger than 124M or 125M parameters (colab)

    Impossible to finetune model bigger than 124M or 125M parameters (colab)

    When i try to finetune gpt-2 355M (because gpt-neo 350M is broken), no matter what i do i will always get a cuda oom error. not with a t4 (and no, i'm not planning on getting colab pro just to play with funy stupeed text gen ai), not with fp16 (by the way, that doesnt work as well) and not even with gradient_checkpointing=True. What the duck am i supposed to do? create vram out of thin air? cant there be something that limits the vram and empties it (i find it astonishing that there's no way to empty the vram other than factory reset of the vm. like, it's vram, not disk storage) to avoid not being able to train/loosing half of the progress/having to factory reset because the vram is full of the previous failed training attempt)

    Any potentially usefull help is appreciated. sorry if i was a little rude, i'm just tired of stuff on github being more broken than a chair that has been set on fire.

    opened by Blazeolmo 1
  • Added UTF-8 support for special characters

    Added UTF-8 support for special characters

    The script now imports codecs and uses it to open output files in UTF-8 format, circumventing exceptions when trying to print certain special characters.

    opened by sobisonator 0
  • New transformer architecture

    New transformer architecture

    I see that, like GANs in their time, a lot of new transformer architectures came out. Is there a better one that we could use now than gpt-2 to generate for example, in my case, ABC music?

    opened by aletote 0
  • Finetuning multiple datasets

    Finetuning multiple datasets

    Lets say Ive 2 different datasets and I want to train them into one model. Would I need to combine the two files into one first and then finetune it? Or can I run two separate runs? I would reckon having to combine them first would cause memory problems when working with a lot of data... but Ive tried running them separately after another but that just messes it up. How would I go about doing this?

    opened by kobel240 1
  • Error when finetuning my model : ValueError: Variable model/wpe already exists

    Error when finetuning my model : ValueError: Variable model/wpe already exists

    Hi, I got issues when I want to finetune over my own model: I got ValueError: Variable model/wpe already exists as mentioned in #167 I reset the session, but still got the issues.

    I trained it from a google colab: here are the code blocks

    Load model and run from drive

    gpt2.download_gpt2(model_name="124M")
    
    gpt2.copy_checkpoint_from_gdrive(run_name='my_model1')
    

    reset session and load the run

    sess = gpt2.reset_session(sess=sess)
    sess = gpt2.start_tf_sess()
    
    gpt2.load_gpt2(sess, run_name='my_model1')
    

    finetune the model

    gpt2.finetune(sess,
                  dataset="source.txt",
                  model_name='124M',
                  steps=1000, 
                  #restore_from='fresh',
                  restore_from='latest',
                  run_name='my_model2',
                  print_every=10,
                  sample_every=200,
                  save_every=500
                  )
    

    Thanks in advance

    opened by Fqlox 0
Releases(v0.8.1)
  • v0.8.1(Oct 18, 2021)

    Thanks to https://github.com/YaleDHLab via https://github.com/minimaxir/gpt-2-simple/pull/275, gpt-2-simple now supports TensorFlow 2 by default, and the minimum TensorFlow version is now 2.5.1! The Colab Notebook has also been update to no longer use TensorFlow 1.X.

    Note: Development on gpt-2-simple has mostly been superceded by aitextgen, which has similar AI text generation capabilities with more efficient training time and resource usage. If you do not require using TensorFlow, I recommend using aitextgen instead. Checkpoints trained using gpt-2-simple can be loaded using aitextgen as well.

    Source code(tar.gz)
    Source code(zip)
    gpt_2_simple-0.8.1.tar.gz(25.84 KB)
  • v0.7.2(Feb 14, 2021)

  • v0.7.1(Dec 28, 2019)

  • v0.7(Dec 1, 2019)

  • v0.6(Aug 28, 2019)

    • 774M is explicitly blocked from being fine-tuned and will trigger an assert if attempted. If a way to finetune it without being super-painful is added, the ability to finetune it will be restored.
    • Allow ability to generate text from the default pretrained models by passing model_name to gpt2.load_gpt2() and gpt2.generate() (this will work with 774M.
    • Addsgd as an optimizer parameter to finetune (default: adam)
    • Support for changed model names, w/ changes more prominent in the README.
    Source code(tar.gz)
    Source code(zip)
    gpt_2_simple-0.6.tar.gz(25.67 KB)
  • v0.5.4(Jul 29, 2019)

    Merged a few PRs:

    Fixed generate cmd run name: #78 Resolved most depreciation warnings: #83 Optional model parameters: #90

    This does not make the package fully TF 2.0 compatible, but it's a big step!

    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Jun 19, 2019)

  • v0.5.2(Jun 18, 2019)

  • v0.5.1(Jun 16, 2019)

  • v0.5(May 20, 2019)

    Adapted a few functions from Neil Shepperd's fork:

    • Nucleus Sampling (top_p) when generating text, which results in surprisingly different results. (setting top_p=0.9 works well). Supercedes top_k when used. (#51)
    • An encode_dataset() function to preencode and compress a large dataset before loading it for finetuning. (#19, #54)

    Improvements to continuing model training:

    • overwrite argument for finetune: with restore_from="latest", this continues model training without creating a duplicate copy of the model, and is therefore good for transfer learning using multiple datasets (#20)
    • You can continue to finetune a model without having the original GPT-2 model present.

    Improvements with I/O involving Colaboratory

    • Checkpoint folders are now packaged into a .tar file when copying to Google Drive, and when copying from Google Drive, the '.tar' file is automatically unpackaged into the correct checkpoint format. (you can pass copy_folder=True to the copy_checkpoint function to revert to the old behavior). (#37: thanks @woctezuma !)
    • copy_checkpoint_to_gdrive and copy_checkpoint_from_gdrive now take a run_name argument instead of a checkpoint_folder argument.

    Miscellaneous

    • Added CLI arguments for top_k, top_p, overwrite.
    • Cleaned up redundant function parameters (#39)
    Source code(tar.gz)
    Source code(zip)
    gpt_2_simple-0.5.tar.gz(24.44 KB)
  • v0.4.2(May 5, 2019)

    • load_gpt2() in a fresh session is much faster and uses much less memory when loaded. (for the 117M model, the system will stay under <2 GB RAM which is the critical point for cloud services)
    • start_tf_sess() now accepts a threads parameter, which is useful if you know exactly how many threads will be used.
    Source code(tar.gz)
    Source code(zip)
    gpt_2_simple-0.4.2.tar.gz(23.46 KB)
  • v0.4.1(May 5, 2019)

  • v0.4(May 5, 2019)

  • v0.3.1(Apr 23, 2019)

    • Fix one-off error where checkpoint saved a step early.
    • Fix issue where restore_from='fresh uses the counter from a previously-trained checkpoint.
    • If restore_from='latest , steps will now train for the specified amount of steps, instead of the training until the specified number of steps. (#13, #14)
    Source code(tar.gz)
    Source code(zip)
    gpt_2_simple-0.3.1.tar.gz(17.84 KB)
  • v0.3(Apr 21, 2019)

  • v0.2(Apr 20, 2019)

Owner
Max Woolf
Data Scientist @buzzfeed. Plotter of pretty charts.
Max Woolf
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 19 Feb 12, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 862 Apr 22, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.7k Apr 17, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.3k Feb 18, 2021
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 15 Mar 26, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI ?? Online live demos: http://tworld.io/s

Sergio Burdisso 247 Apr 20, 2022
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

null 34 Feb 3, 2022
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

Maksim Terpilowski 28 Mar 25, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

null 31 Apr 13, 2022
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 717 Apr 14, 2022
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the Github. Currently, it just works properly on Python but not bad at other languages (thanks to GPT-2's power).

Galois Autocompleter 83 Mar 9, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 2k Apr 16, 2022
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api ?? An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 227 Apr 14, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 28 Feb 17, 2022
Seonghwan Kim 18 Apr 12, 2022
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 1 Dec 27, 2021
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 9, 2022