Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Last update: Jan 07, 2023

Overview

gpt-2-simple

A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifically the "small" 124M and "medium" 355M hyperparameter versions). Additionally, this package allows easier generation of text, generating to a file for easy curation, allowing for prefixes to force the text to start with a given phrase.

This package incorporates and makes minimal low-level changes to:

Model management from OpenAI's official GPT-2 repo (MIT License)
Model finetuning from Neil Shepperd's fork of GPT-2 (MIT License)
Text generation output management from textgenrnn (MIT License / also created by me)

For finetuning, it is strongly recommended to use a GPU, although you can generate using a CPU (albeit much more slowly). If you are training in the cloud, using a Colaboratory notebook or a Google Compute Engine VM w/ the TensorFlow Deep Learning image is strongly recommended. (as the GPT-2 model is hosted on GCP)

You can use gpt-2-simple to retrain a model using a GPU for free in this Colaboratory notebook, which also demos additional features of the package.

Install

gpt-2-simple can be installed via PyPI:

pip3 install gpt-2-simple

You will also need to install the corresponding TensorFlow for your system (e.g. tensorflow or tensorflow-gpu). TensorFlow 2.0 is currently not supported and the package will throw an assertion if loaded, so TensorFlow 1.14/1.15 is recommended.

Usage

An example for downloading the model to the local system, finetuning it on a dataset. and generating some text.

Warning: the pretrained 124M model, and thus any finetuned model, is 500 MB! (the pretrained 355M model is 1.5 GB)

import gpt_2_simple as gpt2
import os
import requests

model_name = "124M"
if not os.path.isdir(os.path.join("models", model_name)):
	print(f"Downloading {model_name} model...")
	gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/124M/


file_name = "shakespeare.txt"
if not os.path.isfile(file_name):
	url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
	data = requests.get(url)
	
	with open(file_name, 'w') as f:
		f.write(data.text)
    

sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              file_name,
              model_name=model_name,
              steps=1000)   # steps is max number of training steps

gpt2.generate(sess)

The generated model checkpoints are by default in /checkpoint/run1. If you want to load a model from that folder and generate text from it:

import gpt_2_simple as gpt2

sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess)

gpt2.generate(sess)

As with textgenrnn, you can generate and save text for later use (e.g. an API or a bot) by using the return_as_list parameter.

single_text = gpt2.generate(sess, return_as_list=True)[0]
print(single_text)

You can pass a run_name parameter to finetune and load_gpt2 if you want to store/load multiple models in a checkpoint folder.

There is also a command-line interface for both finetuning and generation with strong defaults for just running on a Cloud VM w/ GPU. For finetuning (which will also download the model if not present):

gpt_2_simple finetune shakespeare.txt

And for generation, which generates texts to files in a gen folder:

gpt_2_simple generate

Most of the same parameters available in the functions are available as CLI arguments, e.g.:

gpt_2_simple generate --temperature 1.0 --nsamples 20 --batch_size 20 --length 50 --prefix "<|startoftext|>" --truncate "<|endoftext|>" --include_prefix False --nfiles 5

See below to see what some of the CLI arguments do.

NB: Restart the Python session first if you want to finetune on another dataset or load another model.

Differences Between gpt-2-simple And Other Text Generation Utilities

The method GPT-2 uses to generate text is slightly different than those like other packages like textgenrnn (specifically, generating the full text sequence purely in the GPU and decoding it later), which cannot easily be fixed without hacking the underlying model code. As a result:

In general, GPT-2 is better at maintaining context over its entire generation length, making it good for generating conversational text. The text is also generally gramatically correct, with proper capitalization and few typoes.
The original GPT-2 model was trained on a very large variety of sources, allowing the model to incorporate idioms not seen in the input text.
GPT-2 can only generate a maximum of 1024 tokens per request (about 3-4 paragraphs of English text).
GPT-2 cannot stop early upon reaching a specific end token. (workaround: pass the truncate parameter to a generate function to only collect text until a specified end token. You may want to reduce length appropriately.)
Higher temperatures work better (e.g. 0.7 - 1.0) to generate more interesting text, while other frameworks work better between 0.2 - 0.5.
When finetuning GPT-2, it has no sense of the beginning or end of a document within a larger text. You'll need to use a bespoke character sequence to indicate the beginning and end of a document. Then while generating, you can specify a prefix targeting the beginning token sequences, and a truncate targeting the end token sequence. You can also set include_prefix=False to discard the prefix token while generating (e.g. if it's something unwanted like <|startoftext|>).
If you pass a single-column .csv file to finetune(), it will automatically parse the CSV into a format ideal for training with GPT-2 (including prepending <|startoftext|> and suffixing <|endoftext|> to every text document, so the truncate tricks above are helpful when generating output). This is necessary to handle both quotes and newlines in each text document correctly.
GPT-2 allows you to generate texts in parallel by setting a batch_size that is divisible into nsamples, resulting in much faster generation. Works very well with a GPU (can set batch_size up to 20 on Colaboratory's K80)!
Due to GPT-2's architecture, it scales up nicely with more powerful GPUs. For the 124M model, if you want to train for longer periods of time, GCP's P100 GPU is about 3x faster than a K80/T4 for only 3x the price, making it price-comparable (the V100 is about 1.5x faster than the P100 but about 2x the price). The P100 uses 100% of the GPU even with batch_size=1, and about 88% of the V100 GPU.
If you have a partially-trained GPT-2 model and want to continue finetuning it, you can set overwrite=True to finetune, which will continue training and remove the previous iteration of the model without creating a duplicate copy. This can be especially useful for transfer learning (e.g. heavily finetune GPT-2 on one dataset, then finetune on other dataset to get a "merging" of both datasets).
If your input text dataset is massive (>100 MB), you may want to preencode and compress the dataset using gpt2.encode_dataset(file_path). THe output is a compressed .npz file which will load much faster into the GPU for finetuning.
The 774M "large" model may support finetuning because it will cause modern GPUs to go out-of-memory (you may get lucky if you use a P100 GPU on Colaboratory). However, you can still generate from the default pretrained model using gpt2.load_gpt2(sess, model_name='774M') and gpt2.generate(sess, model_name='774M').
The 1558M "extra large", true model, may not work out-of-the-box with the GPU included with the Colaboratory Notebook. More testing is needed to identify optimial configurations for it.

Interactive Apps Using gpt-2-simple

gpt2-small — App using the default GPT-2 124M pretrained model
gpt2-reddit — App to generate Reddit titles based on a specified subreddit and/or keyword(s)
gpt2-mtg — App to generate Magic: The Gathering cards

Text Generation Examples Using gpt-2-simple

ResetEra — Generated video game forum discussions (GitHub w/ dumps)
/r/legaladvice — Title generation (GitHub w/ dumps)
Hacker News — Tens of thousands of generated Hacker News submission titles

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

Disclaimer

This repo has no affiliation or relationship with OpenAI.

Comments

Supporting Tensorflow 2

Hello! The folks at @yaledhlab are huge fans of your work @minimaxir. I wanted to send on the first draft of a pull request to make the library function in Tensorflow 1 or 2.

The only feature not ported to tf2 here is the memory saving gradients logic. The upstream library from which that logic is adopted still hasn't been ported to tf2, and some of the underlying graph traversal methods have been removed from tensorflow itself in 2.x, so there would be a good bit of work getting this running.

That said, the memory saving gradients are simply a performance enhancement. For interested parties, there's a thread going in @cybertronai/gradient-checkpointing and another thread in the @tensorflow repo that discusses some decorater-based approaches to gradient checkpointing in tf2...

In any event, thanks for this great work!

opened by duhaime 19
Finetune Json Issue

When running any of the text files I created the program complains about the following issue regardless of the text file.

sess = gpt2.start_tf_sess() gpt2.finetune(sess, 'java_train_java.txt', model_name=model_name, steps=1000) # steps is max number of training steps

JSONDecodeError Traceback (most recent call last) in 6 'java_train_java.txt', 7 model_name=model_name, ----> 8 steps=1000) # steps is max number of training steps

~/.local/lib/python3.5/site-packages/gpt_2_simple/gpt_2.py in finetune(sess, dataset, steps, model_name, model_dir, combine, batch_size, learning_rate, accumulate_gradients, restore_from, run_name, checkpoint_dir, sample_every, sample_length, sample_num, save_every, print_every, max_checkpoints, use_memory_saving_gradients, only_train_transformer_layers, overwrite) 153 raise(fnf_error) 154 --> 155 enc = encoder.get_encoder(checkpoint_path) 156 hparams = model.default_hparams() 157 with open(os.path.join(checkpoint_path, 'hparams.json')) as f:

~/.local/lib/python3.5/site-packages/gpt_2_simple/src/encoder.py in get_encoder(checkpoint_path) 108 def get_encoder(checkpoint_path): 109 with open(os.path.join(checkpoint_path, 'encoder.json'), 'r') as f: --> 110 encoder = json.load(f) 111 with open(os.path.join(checkpoint_path, 'vocab.bpe'), 'r', encoding="utf-8") as f: 112 bpe_data = f.read()

/usr/lib/python3.5/json/init.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 266 cls=cls, object_hook=object_hook, 267 parse_float=parse_float, parse_int=parse_int, --> 268 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) 269 270

/usr/lib/python3.5/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 317 parse_int is None and parse_float is None and 318 parse_constant is None and object_pairs_hook is None and not kw): --> 319 return _default_decoder.decode(s) 320 if cls is None: 321 cls = JSONDecoder

/usr/lib/python3.5/json/decoder.py in decode(self, s, _w) 337 338 """ --> 339 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 340 end = _w(s, end).end() 341 if end != len(s):

/usr/lib/python3.5/json/decoder.py in raw_decode(self, s, idx) 355 obj, end = self.scan_once(s, idx) 356 except StopIteration as err: --> 357 raise JSONDecodeError("Expecting value", s, err.value) from None 358 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

opened by Rafcin 12
Improved Sampling (Nucleus Sampling)

In a recent paper, they find that beam search actually does even worse than greedy search for GPT-2 117M, and they propose a new sampling method to improve text output.

Neural Text Degradation

From what I understand, it's top-k sampling, except instead of k tokens, it takes all tokens until probability of these tokens combined is > some parameter p.

Here's an example from the paper:

🗣Context: On days when he woke early, the president liked to have coffee in the oval office. There was something about watching the sky grow lighter and lighter as you sat your pajama'd behind in the most powerful chair in the free world, sipping marine strength coffee, that you just couldn't achieve anywhere else.

💰Greedy: He'd been in the oval office for over a year now, and he'd never been so happy to be there. He'd been in the oval office for almost a year, and he'd never been so happy to be there.

🚦BeamSearch (k=4): The President was a man of his word. He was a man of his word. He was a man of his word. He was a man of his word. He was a man of his word. He was a man of his word.

🥄Sampling (t=1.0): You couldn't be sure if that's what you were really doing, and If you decided to take the day off. The president wanted you to take the day off, but he wanted to maintain a curfew and use his influence wisely.

🇰Top-k Sampling (k=8): "I'm sorry, sir." "No, it's okay, son," the president said. "I understand." "You're going to have to make a special trip down there to get that kid. He has no idea where he's going."

⚛️Nucleus Sampling (p=0.9): But that wasn't what drew the president's attention. He'd been seated for maybe a minute when he noticed the other man. What was the guy doing here?

🗣Gold: He was therefore disagreeably surprised to find a man in an understated grey suit sitting in that selfsame chair sipping tea. The president turned around and went looking for his chief of staff.

opened by bob80333 11
text generation quality for Chinese

I use colab try to train a chinese novel , but result is not actually readable as below: ======== SAMPLE 1 ======== 是将吹雾挖出的将成功探测而出。令得她不过来了。在这句她却便是张了什么处。可地云岚宗的基地本就有猜测的唶回纳，若是可见开一些他们纗地图下的威风。有什么那些自抗成形功探也是被实亀约不运的失踪，这个家伙？” “按一东西。 “以后？” 第一千两百四纳乎容其收获双翼下午床偂静以及云岚宗家族时，现在穿过尮层地死死一死的一位完全自人。若是被这位似乎么好。不过这些层地曘众而速的缘故。先前云岚宗家族与家伙破碎，也知道。” “按这些年边按一东西。” 双危得枯落双成一些纸藏。现在云岚宗家族。则是有着更是珋地的纳戒。一名一名视线成功探而来。似此如同一股落地被月地位置身给在落地墓墓吼。将三色山峰。都是在她身处的族人而出。他们。能够如何丧门两个家族事。我没有丝毫。比较给云岚宗家族身族成功压渐了过来二人。” 落地最后对此刻低低的落地。这些人吼力地双更驰在云岚宗这般种有些做完的同局一段时间。就在山脉交手吸了一圈。他们仅仅是将会从丝毫地毒间。那家伙。拥有会难以过足有山脉路。想必地实力。可怕的毒间不会速助引。” 心中现在也算是连脸色。纳戒的一道道人影击杀着自指大会回底独血之人。一个了。能够击

my training parameter is ... gpt2.finetune(sess, dataset="train.txt", model_name='345M', steps=1000, restore_from='fresh', print_every=20, sample_every=200, save_every=500)

Since GPT-2 should be very powerful for text generation, I just want to make sure this quality result is normal or I still have something not figure out yet.

Thnx

opened by chiangandy 8
Multi-gpu support
Added automated gpu name gathering and model partitioning across multiple GPUs

Added boolean multi-gpu signature to finetune and load_gpt2 functions.

Added CLI option for multi_gpu

Note: if layer == 10: in the model was removed so all layers are checkpointed. This can be reverted.
opened by huntrontrakkr 7
GPT2 Chatbot

Hi, I'm new to this stuff, and I'm trying to make a chatbot out of gpt2 using finetune... My question is, can I make gpt remember stuff like this:

Q: Hey, I'm Joe. A: Hey, Joe. Q: What's my name? A:Joe.

Right now, it can barely remember anything I write to it. Do I just need a better more dialogue-focused dataset to finetune it on or is there something else I can do to make it remember? I'm using the colab notebook.

Thanks, I'm rather new to ML, so...

opened by ZeroMaxinumXZ 7

Not able to load the dataset

I have been trying to train the 117M model, with the dataset of size 1.03 GB, with 64 GB ram. But while it load the dataset, it remain stuck there. And after some 30 min, its just terminate. Here is the log.

Fetching checkpoint: 1.00kit [00:00, 679kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 16.5Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 573kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001:  11%|#8               | 53.6M/498M [00:00<00:07, 62.2Mit/s]
Fetching model.ckpt.data-00000-of-00001:  28%|#####3             | 141M/498M [00:01<00:03, 105Mit/s]
Fetching model.ckpt.data-00000-of-00001:  46%|########7          | 230M/498M [00:02<00:02, 108Mit/s]
Fetching model.ckpt.data-00000-of-00001:  63%|###########4      | 316M/498M [00:03<00:02, 66.6Mit/s]
Fetching model.ckpt.data-00000-of-00001:  77%|#############8    | 384M/498M [00:04<00:01, 58.8Mit/s]
Fetching model.ckpt.data-00000-of-00001:  92%|################6 | 460M/498M [00:06<00:00, 44.8Mit/s]
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:06, 72.4Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 3.39Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 9.86Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 9.54Mit/s]                                                       2019-05-19 16:12:23.408514: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instr
uctions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

  0%|          | 0/1 [00:00<?, ?it/s]

I also saw another issue, which ask to cut the text file. How much has to be ideal size in order to train. If not, what model size could go with 1 GB text file ?

Help will be appreciated 👍

opened by nvnvashisth 7

0 tokens when attempting to finetune using .txt

Using the collaboratory (have not tried locally), I tried loading a normal text file and it found 0 tokens.

I've found that splitting on whitespace and turning my source file into a csv was the only way to get past this.

All of the examples reference "shakespeare.txt" but that file isn't included in the repo so I have not been able to confirm what the tool is expecting from a plaintext file.

opened by lukegalea 7
Fails to load dataset on Windows due to text encoding

On Windows 10, when attempting to run this code on my own dataset, I run into this error:

return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2243891: character maps to <undefined>

Adding the parameter encoding='utf8' to line 33 of load_dataset.py: with open(path, 'r') as fp: appears to fix this issue. I'm not 100% sure, because now instead of erroring out immediately, it's using as much RAM as it can

I can open a pull request for this if you'd like.
bug

opened by bob80333 7
Question about loss calculation
Hi all,

I've been reading the code, trying to understand the implementation, but I have a question about the loss function that possibly is kinda dumb.

Suppose a target text: target="The fox went to the forest fast". Also suppose an input sample context=[The, fox, went, to, the, forest].

The loss is calculated as:

loss = tf.reduce_mean( input_tensor=tf.nn.sparse_softmax_cross_entropy_with_logits( labels=context[:, 1:], logits=output['logits'][:, :-1]))

As far as I know, the goal is to predict the nth token given all the previous ones (i.e. "fast", given the context). However, if we set labels as [fox, went, to, the, forest], and the logits as the [0, n-1] tokens, how can the model know the target for the last token (n)? Aren't we teaching the model to output the padded context and leaving as random the last token?

Sorry for the inconvennience.
opened by AIRLegend 6
Resolved most deprecation warnings

I hate hiding warnings/errors but these ones were driving me crazy, so I went through and fixed up all the deprecation warnings I could find. There's still two (that I found) out there that I couldn't figure out how to fix, but this clears up most of them.

opened by charliekmorris 6
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Simple question

Does this gpt-2-simple python package works like an API or is it running on a server?.... does this run locally on our pc?.....

I just started working with these and I was confused...so asked this question.

THANK YOU

opened by Dsanthosh2006 1
change the decode mode from Beam search to something else

Hi, if I want to change the decode mode from Beam search to something else when I use this model, where should I change it? Looking forward to your reply

opened by symebaline 1
Converting generated data to not tokenized default version

Hi there,

Thank you for sharing this repo. My problem is, I am training sequential data, where each word is an unique code in my txt file. Seemingly, GPT does not have any problem to understand and generate this data, however the data it creates are only tokens, before it tokenize the dataset before training. Now I have tokens as output but I need to convert my data back to use it properly. I did not have this problem with textgenrnn because it is based on chars, however i could not run it on colab due to dependencies. How can I map real values to generated tokens?

Thanks a lot.

opened by erenarkangel 0
How to load pretrained model in a new notebook without retraining?

Hello. I trained a model using GPT-2, which is great by the way. I am having trouble getting it to run in a new notebook. Do you have any instructions for how to do so, without retraining? Thank you.

opened by Tylersuard 1

Releases(v0.8.1)

v0.8.1(Oct 18, 2021)

Thanks to https://github.com/YaleDHLab via https://github.com/minimaxir/gpt-2-simple/pull/275, gpt-2-simple now supports TensorFlow 2 by default, and the minimum TensorFlow version is now 2.5.1! The Colab Notebook has also been update to no longer use TensorFlow 1.X.

Note: Development on gpt-2-simple has mostly been superceded by aitextgen, which has similar AI text generation capabilities with more efficient training time and resource usage. If you do not require using TensorFlow, I recommend using aitextgen instead. Checkpoints trained using gpt-2-simple can be loaded using aitextgen as well.
Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.8.1.tar.gz(25.84 KB)
v0.7.2(Feb 14, 2021)
Switched the model URL from GCP to Azure. (#253)

Pin TensorFlow 1.15 (#200)

Add checkpoint loading from other checkpoints (#175)

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.7.2.tar.gz(24.82 KB)
v0.7.1(Dec 28, 2019)

Some have successfully finetuned 774M/1558M, so the assert has been removed.
Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.7.1.tar.gz(24.26 KB)
v0.7(Dec 1, 2019)
Multi-GPU support (#127) (not fully tested; will add some docs when done)

Fixed checkpoint dir bug (#134)

Added a hard assert of a TensorFlow version >= 2.0 is used (#137)

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.7.tar.gz(24.24 KB)
v0.6(Aug 28, 2019)
774M is explicitly blocked from being fine-tuned and will trigger an assert if attempted. If a way to finetune it without being super-painful is added, the ability to finetune it will be restored.

Allow ability to generate text from the default pretrained models by passing model_name to gpt2.load_gpt2() and gpt2.generate() (this will work with 774M.

Addsgd as an optimizer parameter to finetune (default: adam)

Support for changed model names, w/ changes more prominent in the README.

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.6.tar.gz(25.67 KB)
v0.5.4(Jul 29, 2019)

Merged a few PRs:

Fixed generate cmd run name: #78 Resolved most depreciation warnings: #83 Optional model parameters: #90

This does not make the package fully TF 2.0 compatible, but it's a big step!
Source code(tar.gz)
Source code(zip)
v0.5.3(Jun 19, 2019)

Assertion was triggering false positives, so removing it.
Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.5.3.tar.gz(24.93 KB)
v0.5.2(Jun 18, 2019)

Minor fix to prevent issue hit with gpt-2-cloud-run.

A goal of the release was to allow a graph reset without resetting the parameters; that did not seem to work, so holding off on that release.
Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.5.2.tar.gz(24.94 KB)
v0.5.1(Jun 16, 2019)

Merged PRs (including fix for prefix issue). (see commits for more info)
Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.5.1.tar.gz(24.91 KB)
v0.5(May 20, 2019)
Adapted a few functions from Neil Shepperd's fork:

Nucleus Sampling (top_p) when generating text, which results in surprisingly different results. (setting top_p=0.9 works well). Supercedes top_k when used. (#51)

An encode_dataset() function to preencode and compress a large dataset before loading it for finetuning. (#19, #54)

Improvements to continuing model training:

overwrite argument for finetune: with restore_from="latest", this continues model training without creating a duplicate copy of the model, and is therefore good for transfer learning using multiple datasets (#20)

You can continue to finetune a model without having the original GPT-2 model present.

Improvements with I/O involving Colaboratory

Checkpoint folders are now packaged into a .tar file when copying to Google Drive, and when copying from Google Drive, the '.tar' file is automatically unpackaged into the correct checkpoint format. (you can pass copy_folder=True to the copy_checkpoint function to revert to the old behavior). (#37: thanks @woctezuma !)

copy_checkpoint_to_gdrive and copy_checkpoint_from_gdrive now take a run_name argument instead of a checkpoint_folder argument.

Miscellaneous

Added CLI arguments for top_k, top_p, overwrite.

Cleaned up redundant function parameters (#39)

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.5.tar.gz(24.44 KB)
v0.4.2(May 5, 2019)
load_gpt2() in a fresh session is much faster and uses much less memory when loaded. (for the 117M model, the system will stay under <2 GB RAM which is the critical point for cloud services)

start_tf_sess() now accepts a threads parameter, which is useful if you know exactly how many threads will be used.

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.4.2.tar.gz(23.46 KB)
v0.4.1(May 5, 2019)

Number of CSV tokens was inadvertently doubled. (#25)
Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.4.1.tar.gz(23.40 KB)
v0.4(May 5, 2019)
Support the 345M model (thanks to Neil Shepperd for the gradient checkpointing implementation!)

Support model_name in the CLI for above support

Support run_name in the CLI

Support .csv files as an input dataset to finetune (will parse the CSV as if it was done via encode_csv()).

Fix one off issues (#21)

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.4.tar.gz(23.15 KB)
v0.3.1(Apr 23, 2019)
Fix one-off error where checkpoint saved a step early.

Fix issue where restore_from='fresh uses the counter from a previously-trained checkpoint.

If restore_from='latest , steps will now train for the specified amount of steps, instead of the training until the specified number of steps. (#13, #14)

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.3.1.tar.gz(17.84 KB)
v0.3(Apr 21, 2019)
Added a basic CLI.

Added a include_prefix parameter to give an option to exclude the input prefix.

Improved regex for truncation.

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.3.tar.gz(17.81 KB)
v0.2(Apr 20, 2019)
is_gpt2_downloaded: Check if the model is downloaded.

encode_csv: Convert a CSV to a format suitable for GPT-2.

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.2.tar.gz(15.94 KB)
v0.1(Apr 19, 2019)

Source code(tar.gz)
Source code(zip)
gpt_2_simple-0.1.tar.gz(15.49 KB)

Owner

Max Woolf

Data Scientist @buzzfeed. Plotter of pretty charts.

GitHub Repository

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Related tags

Overview

gpt-2-simple

Install

Usage

Differences Between gpt-2-simple And Other Text Generation Utilities

Interactive Apps Using gpt-2-simple

Text Generation Examples Using gpt-2-simple

Maintainer/Creator

License

Disclaimer

Comments

sess = gpt2.start_tf_sess() gpt2.finetune(sess, 'java_train_java.txt', model_name=model_name, steps=1000) # steps is max number of training steps

Patching CVE-2007-4559

Releases(v0.8.1)

v0.8.1(Oct 18, 2021)

v0.7.2(Feb 14, 2021)

v0.7.1(Dec 28, 2019)

v0.7(Dec 1, 2019)

v0.6(Aug 28, 2019)

v0.5.4(Jul 29, 2019)

v0.5.3(Jun 19, 2019)

v0.5.2(Jun 18, 2019)

v0.5.1(Jun 16, 2019)

v0.5(May 20, 2019)

Adapted a few functions from Neil Shepperd's fork:

Improvements to continuing model training:

Improvements with I/O involving Colaboratory

Miscellaneous

v0.4.2(May 5, 2019)

v0.4.1(May 5, 2019)

v0.4(May 5, 2019)

v0.3.1(Apr 23, 2019)

v0.3(Apr 21, 2019)

v0.2(Apr 20, 2019)

v0.1(Apr 19, 2019)

Owner

Max Woolf

End-to-end MLOps pipeline of a BERT model for emotion classification.

Interpretable Models for NLP using PyTorch

Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

Local cross-platform machine translation GUI, based on CTranslate2

Spam filtering made easy for you

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

Text to speech for Vietnamese, ez to use, ez to update

A method to generate speech across multiple speakers

LUKE -- Language Understanding with Knowledge-based Embeddings

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Score-Based Point Cloud Denoising (ICCV'21)

JaQuAD: Japanese Question Answering Dataset

SpikeX - SpaCy Pipes for Knowledge Extraction

Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

NLP codes implemented with Pytorch (w/o library such as huggingface)

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

End-to-End Speech Processing Toolkit

TensorFlow code and pre-trained models for BERT

`sess = gpt2.start_tf_sess() gpt2.finetune(sess, 'java_train_java.txt', model_name=model_name, steps=1000) # steps is max number of training steps`