Skip to content

L-Zhe/FasySeq

Repository files navigation

FasySeq

FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which can be trained efficiently and modified easily. This toolkit is based on Transformer(Vaswani et al.), and will add more seq2seq models in the future.

Dependency

PyTorch >= 1.4
NLTK

Result

Train Dataset Test Dataset BLEU checkpoint
WMT-14 newstest2013-en2de 25.83 url
WMT-14 newstest2014-en2de 27.07 url
WMT-14 newstest2013-de2en 31.14 url
WMT-14 newstest2014-de2en 30.80 url

Structure

...

To Be Updated

  • top-k and top-p sampling
  • multi-GPU inference
  • length penalty in beam search
  • ...

Preprocess

Build Vocabulary

createVocab.py

NamedArguments Description
-f/--file The files used to build the vocabulary.
Type: List
--vocab_num The maximum size of vocabulary, the excess word will be discard according to the frequency.
Type: Int Default: -1
--min_freq The minimum frequency of token in vocabulary. The word with frequency less than min_freq will be discard.
Type: Int Default: 0
--lower Whether to convert all words to lowercase
--save_path The path to save voacbulary.
Type: str

Process Data

preprocess.py

NamedArguments Description
--source The path of source file.
Type: str
[--target] The path of target file.
Type: str
--src_vocab The path of source vocabulary.
Type: str
[--tgt_vocab] The path of target vocabulary.
Type: str
--save_path The path to save the processed data.
Type: str

Train

train.py

NamedArguments Description
Model -
--share_embed Source and target share the same vocabulary and word embedding. The max position of embedding is max(max_src_position, max_tgt_position) if the model employ share embedding.
--max_src_position The maximum source position, all src-tgt pairs which source sentences' lenght are greater than max_src_position will be cut or discard. If max_src_position > max source length, it wil be set to max source length.
Type: Int Default: inf
--max_tgt_position The maximum target position, all src_tgt pairs which target sentences' length are greater than max_tgt_position will be cut or discard. If max_tgt_position > max target length, it wil be set to max target length.
Type: Int Default: inf
--position_method The method to introduce positional information.
Option: encoding/embedding
--normalize_before Leveraging before layer normalization. See Xiong et al.
Checkpoint -
--checkpoint_path The path to save checkpoint file.
Type: str Default: None
--restore_file The checkpoint file to be loaded.
Type: str Default: None
--checkpoint_num Save the nearest checkpoint_num breakpoint.
Type: Int Default: inf
Data -
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--file The training data file.
Type: str
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--discard_invalid_data The data which length of source or data is more than maximum position will be discard if use this option, otherwise the long sentences will be cut into max position.
Train -
--cuda_num The device's ID of GPU.
Type: List
--grad_accumulate The num of gradient accumulate.
Type: Int Default: 1
--epoch The total epoch to train.
Type: Int Default: inf
--batch_print_info The number of batch to print training information.
Type: Int Default: 1000

Inference

generator.py

NamedArguments Description
--cuda_num The device's ID of GPU.
Type: List
--file The inference data file which has been processed.
Type: str
--raw_file The raw inference data file, and will be preprocessed before generated.
Type: str
--ref_file The reference file.
Type: str
--max_length
--max_alpha
--max_add_token
Maximum generated length = min(max_length, max_alpha * max_src_len, max_add_token + max_src_token)
Type: Int Default: inf
--max_tokens The maximum tokens in each batch.
Type: Int Default: 1000
--src_vocab Source vocabulary path.
Type: str Default: None
--tgt_vocab Target vocabulary path.
Type: str Default: None
--vocab Vocabulary path. If you use share embedding, the vocabulary will be loaded from this path.
Type: str Default: None
--model_path The path of pre-trained model.
Type: str
--output_path The path of output. the result will be saved into output_path/result.txt.
Type: str
--decode_method The decode method.
Option:greedy/beam
--beam Beam size.
Type: Int Default: 5

Postpreposs

avg_param.py

The average parameter code we employed is the same as fairseq.

License

FasySeq(-py) is Apache-2.0 License. The license applies to the pre-trained models as well.