LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA

Overview

LightSeq: A High Performance Library for Sequence Processing and Generation

logo

[2021/06/18] πŸŽ‰ πŸŽ‰ πŸŽ‰ LightSeq supports fast training for models in the Transformer family now, please check out here for details.


LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT, Transformer, etc. It is therefore best useful for Machine Translation, Text Generation, Dialog, Language Modelling, Sentiment Analysis, and other related tasks with sequence data.

The library is built on top of CUDA official library(cuBLAS, Thrust, CUB) and custom kernel functions which are specially fused and optimized for Transformer model family. In addition to model components, the inference library also provide easy-to deploy model management and serving backend based on TensorRT Inference Server. With LightSeq, one can easily develop modified Transformer architecture with little additional code.

Features

>>> Training

The following is a support matrix of LightSeq training library compared with DeepSpeed.

features

>>> Inference

The following is a support matrix of LightSeq inference library compared with TurboTransformers and FasterTransformer.

support

Performance

>>> Training

Here we present the experimental results on WMT14 English to German translation task based on Transformer-big models. We train Transformer models of different sizes on eight NVIDIA Tesla V100/NVIDIA Ampere A100 GPUs with data parallel and fp16 mixed precision. Fairseq with Apex is choosed as our baseline.

We compute speedup on different batch size using the WPS (real words per second) metric.

More results is available here

>>> Inference

Here we present the experimental results on neural machine translation based on Transformer-base models using beam search methods. We choose Tensorflow and FasterTransformer as a comparison. The implementation from tensor2tensor was used as the benchmark of Tensorflow.

More results is available here.

Quick Start

Fast training from Fairseq

You can experience lightning fast training by running following commands, Firstly install these requirements.

pip install lightseq fairseq sacremoses

Then you can train a translation task on wmt14 en2de dataset by running the following script

sh examples/training/fairseq/ls_fairseq_wmt14en2de.sh

To compare lightseq with fairseq, delete the arguments with ls_ prefix to using the original fairseq implementation

More usage is available here.

Fast inference from HuggingFace bart

We provide an end2end bart-base example to see how fast Lightseq is compared to HuggingFace. First you should install these requirements.

pip install torch tensorflow transformers lightseq
cd examples/inference/python

then you can check the performance by simply running following commands. hf_bart_export.py is used to transform pytorch weights to LightSeq protobuffer.

python hf_bart_export.py
python ls_bart.py

LightSeq installation from pypi only supports python 3.6 to 3.8 on Linux for now. Consider compiling from source if you have other environments.

More usage is available here.

Cite Us

If you use LightSeq in your research, please cite the following paper.

@InProceedings{wang2021lightseq,
    title = "{L}ight{S}eq: A High Performance Inference Library for Transformers",
    author = "Wang, Xiaohui and Xiong, Ying and Wei, Yang and Wang, Mingxuan and Li, Lei",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers (NAACL-HLT)",
    month = jun,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    pages = "113--120",
}

Contact

Any questions or suggestions, please feel free to contact us at [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Comments
  • RuntimeError: Parse weights from [lightseq_bart_base.hdf5] failed

    RuntimeError: Parse weights from [lightseq_bart_base.hdf5] failed

    When I tried to run the example case like this

    python hf_bart_export.py
    python ls_bart.py
    

    It has some errors

    initializing bart tokenizer...
    creating lightseq model...
    Traceback (most recent call last):
      File "ls_bart.py", line 102, in <module>
        main()
      File "ls_bart.py", line 69, in main
        ls_model = lsi.Transformer("lightseq_bart_base.hdf5", 128)
    RuntimeError: Parse weights from [lightseq_bart_base.hdf5] failed.
    

    Alright,I tried to run other case , huggingface gpt2 in examples:

    python hf_gpt2_export.py
    python ls_gpt.py
    

    It had some error again:

    initializing gpt tokenizer...
    Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.04M/1.04M [00:00<00:00, 1.81MB/s]
    Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 456k/456k [00:00<00:00, 1.36MB/s]
    Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.36M/1.36M [00:00<00:00, 2.29MB/s]
    lightseq tokenizer pad token id: 50257
    huggingface tokenizer pad token id: 50256
    creating lightseq model...
    Traceback (most recent call last):
      File "ls_gpt.py", line 119, in <module>
        main()
      File "ls_gpt.py", line 79, in main
        ls_model = lsi.Gpt("lightseq_gpt2_base.hdf5", max_batch_size=16)
    TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
        1. lightseq.inference.Gpt(weight_path: str, max_batch_size: int, max_step: int)
    
    Invoked with: 'lightseq_gpt2_base.hdf5'; kwargs: max_batch_size=16
    

    I don't know how to fixed them. Can you give me some advices. thank you very much.

    opened by juha0 21
  • lightseq inference abnormal using ls_fs_transformer_export.py exported model

    lightseq inference abnormal using ls_fs_transformer_export.py exported model

    Hi, I used the python export/ls_fs_transformer_export.py to export lightseq trained NMT model to do inference, but I found the result is quiet abnormal. These are some details output in the ls_fs_transformer_export.py test part.

    generator config beam size: 4 extra decode length(max decode length - src input length): 50 length penalty: 0.6 diverse lambda: 0 sampling method: beam_search topk: 1 topp: 0.75 Allocated 882MB GPU buffer for transformer decoder buffer init start decoder buffer init succeed pb results: (array([[[ 4, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 6]]], dtype=int32), array([[0.]], dtype=float32)) hdf5 results: (array([[[ 4, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 6]]], dtype=int32), array([[0.]], dtype=float32))

    I also tested more examples, and it continued to generate some repeated logits, and when I decoded the array with my tgt_dict, it generated something like this:

    thesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesame.

    I used the 0.10.2 fairseq and 2.1.4 lightseq version, and the lightseq-generate result seems normal. I think maybe something wrong happened in the export procedure. Looking forward for your reply.

    opened by dearchill 18
  • fix pos embedding index bug

    fix pos embedding index bug

    Fixed the implementation of position embedding

    • the size of position matrix is determined by max_positions parameter
    • ignore all the padding tokens when calculating the token position
    • the position index begin from padding_idx + 1, consistent with fairseq implementation
    opened by nomadlx 13
  • No acceleration compared with timm vit block

    No acceleration compared with timm vit block

    I use the code below to test the vit block speed. The output shows the speed is almost the same between pytorch and lightseq

    Did I missed something?

    Output for forward only:

    timm finished 500 running, avg_time: 76.379987 ms light_seq finished 500 running, avg_time: 75.543549 ms

    The output for forward + backward:

    timm finished 500 running, avg_time: 228.803998 ms light_seq finished 500 running, avg_time: 227.007331 ms

    from timm.models.vision_transformer import Block
    from lightseq.training.ops.pytorch.transformer_encoder_layer import LSTransformerEncoderLayer
    from easydict import EasyDict as edict
    import torch.nn as nn
    import torch
    import time
    import sys
    sys.path.append('./')
    
    
    torch.backends.cudnn.benchmark = True
    
    
    def generate_dummy_data(args):
        inputs = torch.randn([args.bs, args.num_token, args.dim]).cuda()
        return (inputs, )
    
    
    def get_timm_block(args):
        return Block(
            dim=args.dim,
            num_heads=args.num_heads,
            mlp_ratio=args.mlp_ratio,
            qkv_bias=False,
            drop=False,
            attn_drop=False,
            init_values=None,
            drop_path=0,
            act_layer=nn.GELU,
            norm_layer=nn.LayerNorm
        )
    
    class LSBlockWrapper(LSTransformerEncoderLayer):
        def forward(self, x):
            B, N, C = x.shape
            mask = torch.zeros([B, N, N], device=x.device, dtype=x.dtype)
            return super().forward(x, mask)
    
    def get_ls_block(args):
        config = LSBlockWrapper.get_config(
            max_batch_tokens=args.num_token * args.bs,
            max_seq_len=args.num_token,
            hidden_size=args.dim,
            intermediate_size=int(args.mlp_ratio * args.dim),
            nhead=args.num_heads,
            attn_prob_dropout_ratio=0,
            hidden_dropout_ratio=0,
            activation_dropout_ratio=0,
            pre_layer_norm=True,
            fp16=False,
            local_rank=0,
            activation_fn='gelu')
        return LSBlockWrapper(
                config=config,
                initial_weights=None,
                initial_biases=None
            )
    
    
    def run(module, args, name='Unknown'):
        inputs = generate_dummy_data(args)
    
        # cudnn warmup
        for _ in range(50):
            if args.backward:
                module(*inputs).sum().backward()
            else:
                module(*inputs)
    
        torch.cuda.synchronize()
        t0 = time.time()
    
        for _ in range(args.num_iter):
            if args.backward:
                module(*inputs).sum().backward()
            else:
                module(*inputs)
    
        torch.cuda.synchronize()
        t1 = time.time()
    
        avg_time = (t1 - t0) * 1000 / args.num_iter
        print(
            f'>>> {name} finished {args.num_iter} running, avg_time: {avg_time:.6f} ms')
        return avg_time
    
    
    def main():
        args = edict()
        args.num_iter = 500
        args.backward = False
    
        args.bs = 128
        args.dim = 1280
        args.num_heads = 16
        args.mlp_ratio = 4.0
        args.num_token = 256
    
        timm_block = get_timm_block(args).cuda()
        ls_block = get_ls_block(args).cuda()
    
        run(timm_block, args, name='timm')
        run(ls_block, args, name='light_seq')
    
        print('Finished.')
    
    if __name__ == '__main__':
        main()
    
    opened by woolpeeker 11
  • Gpt exceeds maximum protobuf size of 2GB: 3096122166

    Gpt exceeds maximum protobuf size of 2GB: 3096122166

    when I use lightseq(2.0) export gpt2-large, it raises an error ValueError: Message Gpt exceeds maximum protobuf size of 2GB: 3096122166

    hf_gpt2_export.py is as follows

    
    if __name__ == "__main__":
        output_lightseq_model_name = "lightseq_gpt2_large.pb"
        input_huggingface_gpt_model = "gpt2-large"
        head_number = 36
        # generation_method should be "topk" or "topp"
        generation_method = "topk"
        topk = 1
        topp = 0.75
        # default eos_id from https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel
        eos_id = 50256
        pad_id = 50257
        extract_gpt_weights(
            output_lightseq_model_name,
            input_huggingface_gpt_model,
            head_num=head_number,  # layer number
            generation_method=generation_method,
            topk=topk,
            topp=topp,
            eos_id=eos_id,
            pad_id=pad_id,
        )
    
    
    ['transformer.h.34.mlp.c_proj.bias'] -> ffn_second_bias, shape: (1280,), convert finished.
    ['transformer.h.35.ln_1.weight'] -> multihead_norm_scale, shape: (1280,), convert finished.
    ['transformer.h.35.ln_1.bias'] -> multihead_norm_bias, shape: (1280,), convert finished.
    ['transformer.h.35.attn.c_attn.weight'] -> multihead_project_kernel_qkv, shape: (1280, 3840), convert finished.
    ['transformer.h.35.attn.c_attn.bias'] -> multihead_project_bias_qkv, shape: (3840,), convert finished.
    ['transformer.h.35.attn.c_proj.weight'] -> multihead_project_kernel_output, shape: (1280, 1280), convert finished.
    ['transformer.h.35.attn.c_proj.bias'] -> multihead_project_bias_output, shape: (1280,), convert finished.
    ['transformer.h.35.ln_2.weight'] -> ffn_norm_scale, shape: (1280,), convert finished.
    ['transformer.h.35.ln_2.bias'] -> ffn_norm_bias, shape: (1280,), convert finished.
    ['transformer.h.35.mlp.c_fc.weight'] -> ffn_first_kernel, shape: (1280, 5120), convert finished.
    ['transformer.h.35.mlp.c_fc.bias'] -> ffn_first_bias, shape: (5120,), convert finished.
    ['transformer.h.35.mlp.c_proj.weight'] -> ffn_second_kernel, shape: (5120, 1280), convert finished.
    ['transformer.h.35.mlp.c_proj.bias'] -> ffn_second_bias, shape: (1280,), convert finished.
    ['transformer.ln_f.weight'] -> norm_scale, shape: (1280,), convert finished.
    ['transformer.ln_f.bias'] -> norm_bias, shape: (1280,), convert finished.
    ['transformer.wte.weight'] -> token_embedding, shape: (50257, 1280), convert finished.
    ['transformer.wpe.weight'] -> position_embedding, shape: (1024, 1280), convert finished.
    Wrting to lightseq_gpt2_large.pb
    Traceback (most recent call last):
      File "hf_gpt2_export.py", line 127, in <module>
        pad_id=pad_id,
      File "hf_gpt2_export.py", line 100, in extract_gpt_weights
        fout.write(gpt.SerializeToString())
    ValueError: Message Gpt exceeds maximum protobuf size of 2GB: 3096122166
    
    opened by zmingshi 8
  • [CUDA][ERROR]: misaligned address

    [CUDA][ERROR]: misaligned address

    hi, I have 1 question. When a large amount of text requests the model, the model starts to run properly. After the model runs for a period of time, the program reports an error : [CUDA][ERROR] /tmp/build-via-sdist-uagdfpbf/lightseq-2.2.1/lightseq/inference/pywrapper/gpt.cc.cu(160): misaligned address.

    opened by fc20567 6
  • Questions about beam search

    Questions about beam search

    Hi guys,

    Two questions related beam search confused me and I am looking forward to replyπŸ˜Šγ€‚

    1. Your beam search is same as T2T?
    2. length_penalty == 1.0 means no length_penalty?

    Thx

    opened by gongel 6
  • Can you provide a docker file that can test training and inference code the lightseq?

    Can you provide a docker file that can test training and inference code the lightseq?

    I tried to set up LightSeq on docker system (RTX2080TI 4-way or A100 2-way) but failed to set it for 8 hours.

    Therefore, please upload Dockerfile or Images using LightSeq system test.

    (I tested based on image file nvcr.io/nvidia/pytorch:21.08, 20.12, 20.10, taka23/lightseq .. etc but didn't succeed.)

    opened by pdh930105 6
  • Support for VIT-small (hidden_dim=384)

    Support for VIT-small (hidden_dim=384)

    Hello, thank you for your contribution. I want to replace the encoders in vit-small with LSHFTransformerEncoderLayer. For each encoder, num_attention_heads = 6 and hidden_dim = 384. However, there is an error here says that hidden_dim must be an integer multiple of 256. Why does LSHFTransformerEncoderLayer have this restriction? Is there any solution to use LSHFTransformerEncoderLayer in vit-small? Correct me if I am wrong. Thanks!

    opened by woskii 6
  • Lightseq model inference for fairseq task after training

    Lightseq model inference for fairseq task after training

    Hi, i could not find any details about lightseq model inference for fairseq task after training, did i miss something? I mean after training, the model arch is ls_tranformer, i can't use native fairseq-generate command for inference, and i don't find something like lightseq-generate. I find the examples about inference are huggingface models such as bart and gpt2, and no after-training fairseq model inference documents are provided. Could someone tell me how to do this?

    opened by dearchill 6
  • Example/Support of converting Fairseq Model to run in LightSeq

    Example/Support of converting Fairseq Model to run in LightSeq

    I am curious of trying LightSeq to speed up my inference for a vanilla Transformer Encoder-Decoder (Vasawani 17) model. My original model was trained with FairSeq (or OpenNMT-py). Is there any example or places that you can refer to help me convert my transformer model to the format compatible of running LightSeq?

    opened by pttzty 6
  • [Question]: How to compile lightseq

    [Question]: How to compile lightseq

    I try to compile lightseq by using build.sh, but run into the following problem:

    lightseq/csrc/proto/bert_weight.cc:451:15: error: β€˜class Bert’ has no member named β€˜ParseFromIstream’; did you mean β€˜ParseFromString’?
         if (!bert.ParseFromIstream(&raw_input)) {
                   ^~~~~~~~~~~~~~~~
                   ParseFromString
    
    lightseq/csrc/proto/bert_crf_weight.cc:38:37: error: no match for β€˜operator[]’ (operand types are β€˜const google::protobuf::RepeatedPtrField<BertCrfEncoderLayer>’ and β€˜int’)
       _inner_size = bert.encoder_stack()[0].ffn_first_kernel_size() / _hidden_size;
    

    Branch master and tag v3.0.1 both failed.

    Did I miss something? How can I manage to compile this project?

    opened by FrostML 0
  • Possible memory leak in DecSelfAttentionLayer

    Possible memory leak in DecSelfAttentionLayer

    The constructor creates new objects without shared_ptrs, but the destructor is empty.

    In cpp:

    DecSelfAttentionLayer<T1, T2>::DecSelfAttentionLayer(
        int layer_id, int max_batch_tokens, int max_seq_len, int hidden_size,
        int num_heads, float attn_prob_dropout_ratio,
        float hidden_output_dropout_ratio, bool pre_or_postLayerNorm,
        bool is_post_ln, bool is_continuous_cache)
        : Layer("DecSelfAttentionLayer"),  // necessary
          _layer_id(layer_id),
          _max_batch_tokens(max_batch_tokens),
    
         ..............................
          // operators
          _attn_ln(
              new LayerNormalizeOp<T1, T2>(max_batch_tokens, hidden_size, false)),
    

    In header: virtual ~DecSelfAttentionLayer() {}

    Not sure if this is by design or missing delete calls in the destructor.

    opened by Kangmo 1
  • Question : About construction of total_cache_k, total_cache_v in Transformer

    Question : About construction of total_cache_k, total_cache_v in Transformer

    In lightseq/csrc/models/transformer.cu, Should cache_k_out and cache_v_out call set_ancestor? Otherwise why not remove the unused variable cache_k_out and cache_k_out?

    Transformer::Transformer {
      ...
      for (auto iter : dec_layer_vec) {
        Variable *cache_k = new Variable("cache_k");
        Variable *cache_v = new Variable("cache_v");
        std::tuple<Variable *, Variable *, Variable *> dec_outs =
            (*iter)(dec_emb, total_enc_kv, pad_mask, cache_k, cache_v);
        dec_emb = std::get<0>(dec_outs);
        Variable *cache_k_out = std::get<1>(dec_outs);
        Variable *cache_v_out = std::get<2>(dec_outs);
    
        cache_k->set_ancestor(total_cache_k, cache_size * dec_layer_idx);
        cache_v->set_ancestor(total_cache_v, cache_size * dec_layer_idx);
        dec_layer_idx++;
      }
    

    https://github.com/bytedance/lightseq/blob/2b5592fa658a39a914a5036e665647084d777903/lightseq/csrc/models/transformer.cu#L135

    opened by Kangmo 3
  • LinearOp::forward is getting cublashandle before checking if the context is built.

    LinearOp::forward is getting cublashandle before checking if the context is built.

    LinearOp::forward is getting cublashandle before checking if the context is built.

    problem : LinearOp::forward is getting cublashandle without checking if context is built. LinearOp::backward is checking if the context is built before getting cublashandle.

    solution: Modify LinearOp::forward to check if context is built before getting cublashandle.

    opened by Kangmo 0
  • How to ensemble lightseq models? & the memory usage is too big when generating

    How to ensemble lightseq models? & the memory usage is too big when generating

    I ran into the following two problems when using lightseq3.0.

    1. I pass --path model1:model2 to ensemble model1 and model2 for generation just like fairseq-generate:
    lightseq-generate $DATA_PATH \
        --path part_1/checkpoint_4_267500.pt:part_1/checkpoint_4_265000.pt \
        --batch-size 4 --beam 4 --remove-bpe \
        --gen-subset ${name} \
        --source-lang en \
        --target-lang zh \
        --max-len-a 1 \
        --max-len-b 50 \
        --lenpen 0.6 --fp16
    

    but the operation fails in the middle with the following error(the checkpoints are from the same model). image

    Could you please suggest an example of ensemble?

    1. When I use lightseq-generate for generation, I found that 10GB of memory is required to load a transformer_big model for lightseq while 2GB of memory is only required to load the same model for fairseq. I wonder if this is as expected?

    This is loading a lightseq transformer_big model: image

    This is loading a fairseq transformer_big model: image

    Environment

    • Python 3.7
    • pytorch 1.12
    • fairseq 0.10.2
    • lightseq 3.0
    opened by baoguo1995 0
Releases(v2.2.1)
  • v2.2.1(Dec 6, 2022)

    In the hip_dev branch, LightSeq supports CUDA backend and HIP backend(now support training onlyοΌ‰. LightSeq transformer has a speedup about 7% comparing with FairsSeq transformer under the HIP backend. LightSeq HIP supports multiple NLP models, such as transformer, bert, gpt, etc. Users need no modification with python training. More information about the LightSeq HIP can be found here https://github.com/bytedance/lightseq/blob/hip_dev/README_HIP.md

    Source code(tar.gz)
    Source code(zip)
  • v3.0.1(Nov 2, 2022)

    What's Changed

    • compatible gcq params by @HandH1998 in https://github.com/bytedance/lightseq/pull/409
    • Fix gpu name by @godweiyang in https://github.com/bytedance/lightseq/pull/415

    Full Changelog: https://github.com/bytedance/lightseq/compare/v3.0.0...v3.0.1

    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(Oct 25, 2022)

    It's been a long time since our last release (v2.2.0). For the past one year, we have focused on int8 quantization.

    In this release, LightSeq supports int8 quantized training and inference. Compared with PyTorch QAT, LightSeq int8 training has a speedup of 3x without any performance loss. Compared with previous LightSeq fp16 inference, int8 engine has a speedup up to 1.7x.

    LightSeq int8 engine supports multiple models, such as Transformer, BERT, GPT, etc. For int8 training, the users only need to apply quantization mode to the model using model.apply(enable_quant). For int8 inference, the users only need to use QuantTransformer instead of fp16 Transformer.

    Other releases include supporting models like MoE, fix bugs, performance improvement, etc.

    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Oct 26, 2021)

    Inference

    Support more multi-language models #209

    Fixes

    Fix inference error on HDF5 #208 Fix training error when batch_size=1 #192 Other minor fixes: #205 #202 #193

    Source code(tar.gz)
    Source code(zip)
  • v2.1.3(Aug 19, 2021)

    This version contains several features and bug fixes.

    Training

    relax restriction of layer norm hidden size #137 #161 support inference during training for transformer #141 #146 #147

    Inference

    Add inference support and examples for BERT #145

    Fixes

    fix save/load for training with pytorch #139 fix pos embedding index bug #144

    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Jul 19, 2021)

    This version contains several features and bug fixes.

    Training

    support BertEncoder #116 support torch amp and apex amp #100

    Inference

    support big models like gpt2-large and bart-large #82

    Fixes

    fix adam bug when param size < 1024 #98 fix training compiling fail in cuda < 11 #80

    Source code(tar.gz)
    Source code(zip)
  • v2.0.2(Jun 25, 2021)

  • v2.0.1(Jun 24, 2021)

  • v2.0.0(Jun 20, 2021)

    It's been a long time since our last release (v1.2.0). For the past six months, we have focused on training efficiency.

    In this release, LightSeq supports fast training for models in the Transformer family!

    We provide highly optimized custom operators for PyTorch and TensorFlow, which cover the entire training process for Transformer-based models. Users of LightSeq can use these operators to build their own models with efficient computation.

    In addition, we integrate our custom operators into popular training libraries like Fairseq, Hugging Face, NeurST, which enables a 1.5X-3X end-to-end speedup campred to the native version.

    With only a small amount of code, you can enjoy the excellent performance provided by LightSeq. Try it now!

    Training

    • support lightseq-train to accelerate fairseq training, including optimized transformer model, adam, and label smoothed loss
    • huggingface bert training example
    • neurst transformer training example for Tensorflow users

    Inference

    • support GPT python wrapper
    • inference APIs are moved to lightseq.inference

    This release has API change for inference, all inference API has moved to lightseq.inference. For example, use import lightseq.inference and model = lightseq.inference.Transformer("$PB_PATH", max_batch_size)

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Dec 24, 2020)

  • v1.1.0(Oct 29, 2020)

  • v1.0.0(Dec 6, 2019)

Owner
Bytedance Inc.
Bytedance Inc.
LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021

LoFTR-with-train-script LoFTR:Detector-Free Local Feature Matching with Transformers CVPR 2021 (with train script --- unofficial ---). About Megadepth

Nan Xiaohu 15 Nov 04, 2022
Discovering Interpretable GAN Controls [NeurIPS 2020]

GANSpace: Discovering Interpretable GAN Controls Figure 1: Sequences of image edits performed using control discovered with our method, applied to thr

Erik HΓ€rkΓΆnen 1.7k Jan 03, 2023
YOLOv2 in PyTorch

YOLOv2 in PyTorch NOTE: This project is no longer maintained and may not compatible with the newest pytorch (after 0.4.0). This is a PyTorch implement

Long Chen 1.5k Jan 02, 2023
Convolutional Neural Network for Text Classification in Tensorflow

This code belongs to the "Implementing a CNN for Text Classification in Tensorflow" blog post. It is slightly simplified implementation of Kim's Convo

Denny Britz 5.5k Jan 02, 2023
Single-Stage Instance Shadow Detection with Bidirectional Relation Learning (CVPR 2021 Oral)

Single-Stage Instance Shadow Detection with Bidirectional Relation Learning (CVPR 2021 Oral) Tianyu Wang*, Xiaowei Hu*, Chi-Wing Fu, and Pheng-Ann Hen

Steve Wong 51 Oct 20, 2022
A deep learning library that makes face recognition efficient and effective

Distributed Arcface Training in Pytorch This is a deep learning library that makes face recognition efficient, and effective, which can train tens of

Sajjad Aemmi 10 Nov 23, 2021
Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation

Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation Introduction This is a PyTorch

XMed-Lab 30 Sep 23, 2022
Object Tracking and Detection Using OpenCV

Object tracking is one such application of computer vision where an object is detected in a video, otherwise interpreted as a set of frames, and the object’s trajectory is estimated. For instance, yo

Happy N. Monday 4 Aug 21, 2022
ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS.

ONNX Runtime Web demo is an interactive demo portal showing real use cases running ONNX Runtime Web in VueJS. It currently supports four examples for you to quickly experience the power of ONNX Runti

Microsoft 58 Dec 18, 2022
Official implementation for ICDAR 2021 paper "Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer"

Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer Description Convert offline handwritten mathematical expressi

Wenqi Zhao 87 Dec 27, 2022
Unofficial implementation of MLP-Mixer: An all-MLP Architecture for Vision

MLP-Mixer: An all-MLP Architecture for Vision This repo contains PyTorch implementation of MLP-Mixer: An all-MLP Architecture for Vision. Usage : impo

Rishikesh (ΰ€‹ΰ€·ΰ€Ώΰ€•ΰ₯‡ΰ€Ά) 175 Dec 23, 2022
The 1st Place Solution of the Facebook AI Image Similarity Challenge (ISC21) : Descriptor Track.

ISC21-Descriptor-Track-1st The 1st Place Solution of the Facebook AI Image Similarity Challenge (ISC21) : Descriptor Track. You can check our solution

lyakaap 75 Jan 08, 2023
This code finds bounding box of a single human mouth.

This code finds bounding box of a single human mouth. In comparison to other face segmentation methods, it is relatively insusceptible to open mouth conditions, e.g., yawning, surgical robots, etc. T

iThermAI 4 Nov 27, 2022
Easy way to add GoogleMaps to Flask applications. maintainer: @getcake

Flask Google Maps Easy to use Google Maps in your Flask application requires Jinja Flask A google api key get here Contribute To contribute with the p

Flask Extensions 611 Dec 05, 2022
ICLR 2021 i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning

Introduction PyTorch code for the ICLR 2021 paper [i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning]. @inproceedings{lee2021i

Kibok Lee 68 Nov 27, 2022
CTF challenges from redpwnCTF 2021

redpwnCTF 2021 Challenges This repository contains challenges from redpwnCTF 2021 in the rCDS format; challenge information is in the challenge.yaml f

redpwn 27 Dec 07, 2022
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval project page | arXiv | webvid-data Repository containing the code,

225 Dec 25, 2022
Object Detection Projekt in GKI WS2021/22

tfObjectDetection Object Detection Projekt with tensorflow in GKI WS2021/22 Docker Container: docker run -it --name --gpus all -v path/to/project:p

Tim Eggers 1 Jul 18, 2022
πŸ₯A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

PyTorch implementation of OpenAI's Finetuned Transformer Language Model This is a PyTorch implementation of the TensorFlow code provided with OpenAI's

Hugging Face 1.4k Jan 05, 2023