Sentence Embeddings with BERT & XLNet

Overview

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known as sentence embeddings). The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and are tuned specificially meaningul sentence embeddings such that sentences with similar meanings are close in vector space.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net, as well as our publications:

Installation

We recommend Python 3.6 or higher, PyTorch 1.6.0 or higher and transformers v3.1.0 or higher. The code does not work with Python 2.7.

Install with pip

Install the sentence-transformers with pip:

pip install -U sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

See Quickstart in our documenation.

This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.

First download a pretrained model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

Then provide some sentences to the model.

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

And that's it already. We now have a list of numpy arrays with the embeddings.

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Pre-Trained Models

We provide a large list of Pretrained Models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: SentenceTransformer('model_name').

» Full list of pretrained models

Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See Training Overview for an introduction how to train your own embedding models. We provide various examples how to train models on various datasets.

Some highlights are:

  • Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
  • Multi-Lingual and multi-task learning
  • Evaluation during training to find optimal model
  • 10+ loss-functions allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, constrative loss.

Performance

Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed.

Model STS benchmark SentEval
Avg. GloVe embeddings 58.02 81.52
BERT-as-a-service avg. embeddings 46.35 84.04
BERT-as-a-service CLS-vector 16.50 84.66
InferSent - GloVe 68.03 85.59
Universal Sentence Encoder 74.92 85.10
Sentence Transformer Models
nli-bert-base 77.12 86.37
nli-bert-large 79.19 87.78
stsb-bert-base 85.14 86.07
stsb-bert-large 85.29 86.66
stsb-roberta-base 85.44 -
stsb-roberta-large 86.39 -
stsb-distilbert-base 85.16 -

Application Examples

You can use this framework for:

and many more use-cases.

For all examples, see examples/applications.

Citing & Authors

If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

If you use the code for data augmentation, feel free to cite our publication Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks:

@article{thakur-2020-AugSBERT,
    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
    author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
    journal= "arXiv preprint arXiv:2010.08240",
    month = "10",
    year = "2020",
    url = "https://arxiv.org/abs/2010.08240",
}

The main contributors of this repository are:

Contact person: Nils Reimers, [email protected]

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Issues
  • Is it Multilingual?

    Is it Multilingual?

    Hello,

    This might be a stupid question, but i wanted to know if I can use the clustering on German sentences? Will it work with the pre-trained model or do I need to train it on German data first?

    Thanks.

    opened by SouravDutta91 41
  • Fine-tune multilingual model for domain specific vocab

    Fine-tune multilingual model for domain specific vocab

    Thanks for the repository and for continuous updates.

    Wanted to check if understood it correctly: Is it possible to continue fine-tuning one of the multilingual models for a specific domain? For example I can take 'xlm-r-distilroberta-base-paraphrase-v1' and fine-tune it on domain-related parallel data( English-other languages) with MultipleNegativesRankingLoss?

    opened by langineer 30
  • Is it possible to encode by using multi-GPU?

    Is it possible to encode by using multi-GPU?

    Thanks for this beautiful package, it saves a lot of work to do semantic embedding. I am running a large size data base trying to transform docs into embedding matrix. When I was running with the code, it seemed only using single GPU to encode the sentence. Is there any way that I could do this by multi-GPU?

    opened by z307287280 28
  • public.ukp.informatik.tu-darmstadt.de Unreachable

    public.ukp.informatik.tu-darmstadt.de Unreachable

    It looks like the server which hosts the pre-trained models (https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/) has been unavailable for a few hours now.

    opened by Ganners 20
  • HTTPError: 403 Client Error:

    HTTPError: 403 Client Error:

    I get a request error and I do not know why.

    
    [W 2021-02-02 18:43:15,951] Trial 0 failed because of the following error: HTTPError('403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip',)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/optuna/_optimize.py", line 211, in _run_trial
        value_or_values = func(trial)
      File "<ipython-input-6-af5cb77f5b44>", line 40, in objective
        model = SentenceTransformer(model_name)  # distiluse-base-multilingual-cased-v2  distilbert-multilingual-nli-stsb-quora-ranking
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 92, in __init__
        raise e
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 75, in __init__
        http_get(model_url, zip_save_path)
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/util.py", line 201, in http_get
        req.raise_for_status()
      File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 941, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip
    
    HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip
    
    opened by tide90 15
  • Fine-tune underlying language model for SBERT

    Fine-tune underlying language model for SBERT

    Hi,

    I'd like to use SBERT model architecture for document similarity and topic modelling tasks. However, my data corpus is fairly specific to domain, and I suspect that SBERT will underperform as it was trained on generic WIki/Library corpuses. So, I wonder if there are any recommendation around fine-tuning of underlying language model for SBERT.

    I envision that the overall process will be following:

    1. Take pre-trained BERT model
    2. Fine tune Language Model on domain-specific corpus
    3. Then retrain SBERT model architecture on specific tasks (e.g. SNLI dataset/task)

    Curious to hear thought on the approach and problem definition.

    opened by vdabravolski 14
  • 'torch._C.PyTorchFileReader' object has no attribute'seek'

    'torch._C.PyTorchFileReader' object has no attribute'seek'

    Hello,

    I am using the following model for sentence similarity

    https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual/tree/main

    word_embedding_model = models.Transformer(bert_model_dir)  # , max_seq_length=512
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model], device=device_str)
    

    But, I get this error:

    Traceback (most recent call last):
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 306, in _check_seekable
    
        f.seek(f.tell())
    
    AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1205, in from_pretrained
    
        state_dict = torch.load(resolved_archive_file, map_location="cpu")
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 584, in load
    
        return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/moxing/framework/file/file_io_patch.py", line 200, in _load
    
        _check_seekable(f)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 309, in _check_seekable
    
        raise_err_msg(["seek", "tell"], e)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 302, in raise_err_msg
    
        raise type(e)(msg)
    
    AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead .
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "code/similarity.py", line 118, in <module>
    
        word_embedding_model = models.Transformer(bert_model_dir) #, max_seq_length=512
    
      File "/home/work/anaconda/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py", line 30, in __init__
    
        self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 381, in from_pretrained
    
        return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1208, in from_pretrained
    
        f"Unable to load weights from pytorch checkpoint file for'{pretrained_model_name_or_path}' "
    
    OSError: Unable to load weights from pytorch checkpoint file for'/home/work/user-job-dir/input/pretrained_models/stsb-xlm-r-multilingual/' at'/home/work/user-job-dir/input /pretrained_models/stsb-xlm-r-multilingual/pytorch_model.bin'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. 
    

    I checked on web but could not find any solution. What could be the problem? Thank you.

    opened by deadsoul44 13
  • Getting SSL Error in downloading

    Getting SSL Error in downloading "distilroberta-base-paraphrase-v1" model embeddings

    I am using google collab with PyTorch version 1.7.0+cu101 I am getting an SSL Error when I am trying to download "distilroberta-base-paraphrase-v1" model.

    Code from sentence_transformers import SentenceTransformer model = SentenceTransformer('distilroberta-base-paraphrase-v1')

    Error

    SSLError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

    24 frames SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)

    During handling of the above exception, another exception occurred:

    MaxRetryError Traceback (most recent call last) MaxRetryError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

    During handling of the above exception, another exception occurred:

    SSLError Traceback (most recent call last) SSLError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

    During handling of the above exception, another exception occurred:

    FileNotFoundError Traceback (most recent call last) /usr/lib/python3.6/shutil.py in rmtree(path, ignore_errors, onerror) 473 # lstat()/open()/fstat() trick. 474 try: --> 475 orig_st = os.lstat(path) 476 except Exception: 477 onerror(os.lstat, path, sys.exc_info())

    FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/sentence_transformers/sbert.net_models_distilroberta-base-paraphrase-v1'

    opened by rahuliitkgp31 13
  • getting Wrong shape for input_ids, while trying to replicate example

    getting Wrong shape for input_ids, while trying to replicate example

    error

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-8-1e9a71fa956b> in <module>
    ----> 1 all_common_phase_vec = model_sent.encode(all_common_phase)
    
    /flucast/anaconda3/envs/e2/lib/python3.8/site-packages/sentence_transformers/SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, is_pretokenized)
        185 
        186             with torch.no_grad():
    --> 187                 out_features = self.forward(features)
        188                 embeddings = out_features[output_value]
        189 
    
    /flucast/anaconda3/envs/e2/lib/python3.8/site-packages/torch/nn/modules/container.py in forward(self, input)
        115     def forward(self, input):
        116         for module in self:
    --> 117             input = module(input)
        118         return input
        119 
    
    /flucast/anaconda3/envs/e2/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    /flucast/anaconda3/envs/e2/lib/python3.8/site-packages/sentence_transformers/models/RoBERTa.py in forward(self, features)
         32     def forward(self, features):
         33         """Returns token_embeddings, cls_token"""
    ---> 34         output_states = self.roberta(**features)
         35         output_tokens = output_states[0]
         36         cls_tokens = output_tokens[:, 0, :]  # CLS token is first token
    
    /flucast/anaconda3/envs/e2/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    /flucast/anaconda3/envs/e2/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
        802         # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        803         # ourselves in which case we just need to make it broadcastable to all heads.
    --> 804         extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
        805 
        806         # If a 2D ou 3D attention mask is provided for the cross-attention
    
    /flucast/anaconda3/envs/e2/lib/python3.8/site-packages/transformers/modeling_utils.py in get_extended_attention_mask(self, attention_mask, input_shape, device)
        258                 extended_attention_mask = attention_mask[:, None, None, :]
        259         else:
    --> 260             raise ValueError(
        261                 "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
        262                     input_shape, attention_mask.shape
    
    ValueError: Wrong shape for input_ids (shape torch.Size([40])) or attention_mask (shape torch.Size([40]))
    
    • error is generated when using pertained model(model_name : roberta-large-nli-stsb-mean-tokens)

    system running on

    Python 3.8.5 (default, Aug  5 2020, 08:36:46) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    

    package installed in env

    Package               Version
    --------------------- -------------------
    argon2-cffi           20.1.0
    attrs                 20.1.0
    backcall              0.2.0
    bleach                3.1.5
    blis                  0.4.1
    catalogue             1.0.0
    certifi               2020.6.20
    cffi                  1.14.2
    chardet               3.0.4
    click                 7.1.2
    cymem                 2.0.3
    decorator             4.4.2
    defusedxml            0.6.0
    entrypoints           0.3
    filelock              3.0.12
    future                0.18.2
    idna                  2.10
    ipykernel             5.3.4
    ipython               7.17.0
    ipython-genutils      0.2.0
    jedi                  0.17.2
    Jinja2                2.11.2
    joblib                0.16.0
    json5                 0.9.5
    jsonschema            3.2.0
    jupyter-client        6.1.7
    jupyter-core          4.6.3
    jupyterlab            2.2.6
    jupyterlab-server     1.2.0
    MarkupSafe            1.1.1
    mistune               0.8.4
    mkl-fft               1.1.0
    mkl-random            1.1.1
    mkl-service           2.3.0
    murmurhash            1.0.2
    nbconvert             5.6.1
    nbformat              5.0.7
    nltk                  3.5
    notebook              6.1.3
    numpy                 1.19.1
    olefile               0.46
    packaging             20.4
    pandas                1.1.1
    pandocfilters         1.4.2
    parso                 0.7.1
    pexpect               4.8.0
    pickleshare           0.7.5
    Pillow                7.2.0
    pip                   20.2.2
    plac                  1.1.3
    preshed               3.0.2
    prometheus-client     0.8.0
    prompt-toolkit        3.0.6
    ptyprocess            0.6.0
    pycparser             2.20
    Pygments              2.6.1
    pyparsing             2.4.7
    pyrsistent            0.16.0
    python-dateutil       2.8.1
    pytz                  2020.1
    pyzmq                 19.0.2
    regex                 2020.7.14
    requests              2.24.0
    sacremoses            0.0.43
    scikit-learn          0.23.2
    scipy                 1.5.2
    Send2Trash            1.5.0
    sentence-transformers 0.3.3
    sentencepiece         0.1.91
    setuptools            49.6.0.post20200814
    six                   1.15.0
    spacy                 2.3.2
    srsly                 1.0.2
    terminado             0.8.3
    testpath              0.4.4
    thinc                 7.4.1
    threadpoolctl         2.1.0
    tokenizers            0.8.1rc2
    torch                 1.6.0
    torchvision           0.7.0
    tornado               6.0.4
    tqdm                  4.48.2
    traitlets             4.3.3
    transformers          3.0.2
    urllib3               1.25.10
    wasabi                0.8.0
    wcwidth               0.2.5
    webencodings          0.5.1
    wheel                 0.35.1
    xlrd                  1.2.0
    
    opened by amiyamandal-dev 13
  • T5 models for sentence transformers

    T5 models for sentence transformers

    Thank you for a terrific project! Are you planning on integrating t5 models into sentence transformers (https://github.com/google-research/text-to-text-transfer-transformer/blob/master/README.md#released-model-checkpoints)?

    opened by e-ndrus 13
  • ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    After pip installing and trying to import SentenceTransformer I get this error: ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    When I look into the source code the only folder I have is models. I am missing evaluation, etc. Any Idea why?

    opened by DavidBegert 13
  • RerankingEvaluator taking too much time

    RerankingEvaluator taking too much time

    Hi,

    I am currently working on the finetuning of "distiluse-base-multilingual-cased-v1", using MultipleNegativesRankingLoss and RerankingEvaluator, over a dataset of 700k (query, sentence) pairs. I'm currently facing a problem with the evaluator as it takes too much time for an evaluation over approximately 8000 unique evaluation pairs. I am using a gpu for the task. Is this normal behaviour?

    Thank you for your help !

    opened by chaalic 3
  • ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found

    ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found

    I am getting error at line:

    from sentence_transformers import SentenceTransformer, util

    Error Log: https://gist.github.com/AnkS4/10be2ea6ecc226cb309e9d4f587bc994

    opened by AnkS4 0
  • Can not see the Loss values during training

    Can not see the Loss values during training

    Hi,

    I am traing a model according to Training Overview.

    And I found there are not any outputs about the loss values of each iterate.

    Could you please tell me how to print the loss values? (the score of the loss function)

    opened by ralgond 1
  • How to reproduce the model

    How to reproduce the model "all-MiniLM-L6-v2"

    Hi,

    In the page, it says "With SentenceTransformer('all-MiniLM-L6-v2') we define which sentence transformer model we like to load. In this example, we load all-MiniLM-L6-v2, which is a MiniLM model fine tuned on a large dataset of over 1 billion training pairs."

    If I want to reproduce the model "all-MiniLM-L6-v2", How should I do?

    Could you please make the code that produce the model open to us?

    Thanks.

    opened by ralgond 1
  • Bugfix - missing HF token to retrieve model information

    Bugfix - missing HF token to retrieve model information

    The SentenceTransformer class supports the use_auth_token parameter, but the snapshot_download function does not use the huggingface token to retrieve the model information before downloading the model. Therefore, the huggingface api returns a 404 if you want to use private models (_api.model_info).

    opened by ArzelaAscoIi 0
  • embed on gpu, load on cpu

    embed on gpu, load on cpu

    Is it possible to create embeddings on gpu, but then load them on cpu.

    When I try to load the pickeled embeddings, I receive the error:

    Unpickling error: pickle files truncated

    opened by meyerjoe-R 5
Releases(v2.2.0)
  • v2.2.0(Feb 10, 2022)

    T5

    You can now use the encoder from T5 to learn text embeddings. You can use it like any other transformer model:

    from sentence_transformers import SentenceTransformer, models
    word_embedding_model = models.Transformer('t5-base', max_seq_length=256)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
    

    See T5-Benchmark results - the T5 encoder is not the best model for learning text embeddings models. It requires quite a lot of training data and training steps. Other models perform much better, at least in the given experiment with 560k training triplets.

    New Models

    The models from the papers Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models and Large Dual Encoders Are Generalizable Retrievers have been added:

    For benchmark results, see https://seb.sbert.net

    Private Models

    Thanks to #1406 you can now load private models from the hub:

    model = SentenceTransformer("your-username/your-model", use_auth_token=True)
    
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Oct 1, 2021)

    This is a smaller release with some new features

    MarginMSELoss

    MarginMSELoss is a great method to train embeddings model with the help of a cross-encoder model. The details are explained here: MSMARCO - MarginMSE Training

    You pass your training data in the format:

    InputExample(texts=[query, positive, negative], label=cross_encoder.predict([query, positive])-cross_encoder.predict([query, negative])
    

    MultipleNegativesSymmetricRankingLoss

    MultipleNegativesRankingLoss computes the loss just in one way: Find the correct answer for a given question.

    MultipleNegativesSymmetricRankingLoss also computes the loss in the other direction: Find the correct question for a given answer.

    Breaking Change: CLIPModel

    The CLIPModel is now based on the transformers model.

    You can still load it like this:

    model = SentenceTransformer('clip-ViT-B-32')
    

    Older SentenceTransformers versions are now longer able to load and use the 'clip-ViT-B-32' model.

    Added files on the hub are automatically downloaded

    PR #1116 checks if you have all files in your local cache or if there are added files on the hub. If this is the case, it will automatically download them.

    SentenceTransformers.encode() can return all values

    When you set output_value=None for the encode method, all values (token_ids, token_embeddings, sentence_embedding) will be returned.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Jun 24, 2021)

    Models hosted on the hub

    All pre-trained models are now hosted on the Huggingface Models hub.

    Our pre-trained models can be found here: https://huggingface.co/sentence-transformers

    But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

    model = SentenceTransformer('[your_username]/[model_name]')
    

    For more information, see: Sentence Transformers in the Hugging Face Hub

    Breaking changes

    There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

    Find sentence-transformer models on the Hub

    You can filter the hub for sentence-transformers models: https://huggingface.co/models?filter=sentence-transformers

    Add the sentence-transformers tag to you model card so that others can find your model.

    Widget & Inference API

    A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

    Further, models can now be used with the Accelerated Inference API: Send you sentences to the API and get back the embeddings from the respective model.

    Save Model to Hub

    A new method was added to the SentenceTransformer class: save_to_hub.

    Provide the model name and the model is saved on the hub.

    Here you find the explanation from transformers how the hub works: Model sharing and uploading

    Automatic Model Card

    When you save a model with save or save_to_hub, a README.md (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.

    New Models

    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jun 24, 2021)

  • v1.2.0(May 12, 2021)

    Unsupervised Sentence Embedding Learning

    New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.

    New methods:

    Pre-Training Methods

    • MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

    Training Examples

    New models

    New Functions

    • SentenceTransformer.fit() Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info
    • Pooling-mode as string: You can now pass the pooling-mode to models.Pooling() as string:
      pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
      

      Valid values are mean/max/cls.

    • NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Apr 21, 2021)

    Unsupervised Sentence Embedding Learning

    This release integrates methods that allows to learn sentence embeddings without having labeled data:

    • TSDAE: TSDAE is using a denoising auto-encoder to learn sentence embeddings. The method has been presented in our recent paper and achieves state-of-the-art performance for several tasks.
    • GenQ: GenQ uses a pre-trained T5 system to generate queries for a given passage. It was presented in our recent BEIR paper and works well for domain adaptation for (semantic search)[https://www.sbert.net/examples/applications/semantic-search/README.html]

    New Models - SentenceTransformer

    • MSMARCO Dot-Product Models: We trained models using the dot-product instead of cosine similarity as similarity function. As shown in our recent BEIR paper, models with cosine-similarity prefer the retrieval of short documents, while models with dot-product prefer retrieval of longer documents. Now you can choose what is most suitable for your task.
    • MSMARCO MiniLM Models: We uploaded some models based on MiniLM: It uses just 384 dimensions, is faster than previous models and achieves nearly the same performance

    New Models - CrossEncoder

    New Features

    • You can now pass to the CrossEncoder class a default_activation_function, that is applied on-top of the output logits generated by the class.
    • You can now pre-process images for the CLIP Model. Soon I will release a tutorial how to fine-tune the CLIP Model with your data.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.4(Apr 1, 2021)

    It was not possible to fine-tune and save the CLIPModel. This release fixes it. CLIPModel can now be saved like any other model by calling model.save(path)

    Source code(tar.gz)
    Source code(zip)
  • v1.0.3(Mar 22, 2021)

  • v1.0.2(Mar 19, 2021)

    v1.0.2 - Patch for CLIPModel, new Image Examples

    • Bugfix in CLIPModel: Too long inputs raised a RuntimeError. Now they are truncated.
    • New util function: util.paraphrase_mining_embeddings, to find most similar embeddings in a matrix
    • Image Clustering and Duplicate Image Detection examples added: more info
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 18, 2021)

    This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

    Text-Image-Model CLIP

    You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

    from sentence_transformers import SentenceTransformer, util
    from PIL import Image
    
    #Load CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    
    #Encode an image:
    img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
    
    #Encode text descriptions
    text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
    
    #Compute cosine similarities 
    cos_scores = util.cos_sim(img_emb, text_emb)
    print(cos_scores)
    

    More Information IPython Demo Colab Demo

    Examples how to train the CLIP model on your data will be added soon.

    New Models

    New Features

    • The Asym Model can now be used as the first model in a SentenceTransformer modules list.
    • Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
    • Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
    • New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length
    • New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.
    • If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

    New Examples

    Bugfixes

    • Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined
    • Bugfix of the LabelAccuracyEvaluator
    • Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

    Breaking changes:

    • SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 4, 2021)

    Refactored Tokenization

    • Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
    • Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:
    train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
        InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    
    • If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
    • Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

    Asymmetric Models Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

    word_embedding_model = models.Transformer(base_model, max_seq_length=250)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
    d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
    asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])
    
    ##Your input examples have to look like this:
    inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)
    
    ##Encoding (Note: Mixed inputs are not allowed)
    model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])
    

    Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer. More documentation on how to design asymmetric models will follow soon.

    New Namespace & Models for Cross-Encoder Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

    Logging Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

    Unit tests A lot more unit tests have been added, which test the different components of the framework.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 22, 2020)

    • Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
    • New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
    • New application example for information retrieval and question answering retrieval. Together with respective pre-trained models
    Source code(tar.gz)
    Source code(zip)
  • v0.3.9(Nov 18, 2020)

    This release only include some smaller updates:

    • Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
    • As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
    • model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
    • The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
    • The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.8(Oct 19, 2020)

    • Add support training and using CrossEncoder
    • Data Augmentation method AugSBERT added
    • New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
    • New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
    • Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
    • New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

    Smaller changes:

    • Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
    • SentenceTransformer.encode method detaches tensors from compute graph
    • SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty
    Source code(tar.gz)
    Source code(zip)
  • v0.3.7(Sep 29, 2020)

    • Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
    • Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
    • Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

    Minor changes:

    • Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
    • Added models.Normalize() to allow the normalization of embeddings to unit length
    Source code(tar.gz)
    Source code(zip)
  • v0.3.6(Sep 11, 2020)

    Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

    This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.5(Sep 1, 2020)

    • The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.
    • Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
    • If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
    • Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
    • Several bugfixes: Downloading of files, mutli-GPU-encoding
    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Aug 24, 2020)

    • The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
    • The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.
    • model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.
    • Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
    • Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
    • Smaller bugfixes

    Breaking changes:

    • Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator
    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Aug 6, 2020)

    New Functions

    • Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
    • Tokenization of datasets for training can now run in parallel (Linux Only)
    • New example for Quora Duplicate Questions Retrieval: See examples-folder
    • Many small improvements for training better models for Information Retrieval
    • Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
    • Added new Evaluators for ParaphraseMining and InformationRetrieval
    • evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
    • model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
    • New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
    • New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

    Breaking Changes

    • The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jul 23, 2020)

    This is a minor release. There should be no breaking changes.

    • ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
    • util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
    • SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jul 22, 2020)

    This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.

    The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.

    The following classes/files have been changed:

    • datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.

    New evaluation files:

    • evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py
    • evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores
    • evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame
    • evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader

    Bugfixes:

    • model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 9, 2020)

    This release updates HuggingFace transformers to v3.0.2. Transformers did some breaking changes to the tokenization API. This (and future) versions will not be compatible with HuggingFace transfomers v2.

    There are no known breaking changes for existent models or existent code. Models trained with version 2 can be loaded without issues.

    New Loss Functions

    Thanks to PR #299 and #176 several new loss functions: Different triplet loss functions and ContrastiveLoss

    Source code(tar.gz)
    Source code(zip)
  • v0.2.6(Apr 16, 2020)

    The release update huggingface/transformers to the release v2.8.0.

    New Features

    • models.Transformer: The Transformer-Model can now load any huggingface transformers model, like BERT, RoBERTa, XLNet, XLM-R, Elextra... It is based on the AutoModel from HuggingFace. You now longer need the architecture specific models (like models.BERT, models.RoBERTa) any more. It also works with the community models.
    • Multilingual Training: Code is released for making mono-lingual sentence embeddings models mutli-lingual. See training_multilingual.py for an example. More documentation and details will follow soon.
    • WKPooling: Adding a pytorch implementation of SBERT-WK. Note, due to an inefficient implementation in pytorch of QR decomposition, WKPooling can only be run on the CPU, which makes it about 40 slower than mean pooling. For some models WKPooling improves the performance, for other don't.
    • WeightedLayerPooling: A new pooling layer that uses representations from all transformer layers and learns a weighted sum of them. So far no improvement compared to only averaging the last layer.
    • New pre-trained models released. Every available model is document in a google Spreadsheet for an easier overview.

    Minor changes

    • Clean-up of the examples folder.
    • Model and tokenizer arguments can now be passed to the according transformers models.
    • Previous version had some issues with RoBERTa and XLM-RoBERTa, that the wrong special characters were added. Everything is fixed now and relies on huggingface transformers for the correct addition of special characters to the input sentences.

    Breaking changes

    • STSDataReader: The default parameter values have been changed, so that it expects the sentences in the first two columns and the score in the third column. If you want to load the STS benchmkark dataset, you can use the STSBenchmarkDataReader.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Jan 10, 2020)

    huggingface/transformers was updated to version 2.3.0

    Changes:

    • ALBERT works (bug was fixed in transformers). Does not yield improvements compared to BERT / RoBERTA
    • T5 added (does not run on GPU due to a bug in transformers). Does not yield improvements compared to BERT / RoBERTA
    • CamemBERT added
    • XML-RoBERTa added
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Dec 6, 2019)

    This version update the underlying HuggingFace Transformer package to v2.2.1.

    Changes:

    • DistilBERT and ALBERT modules added
    • Pre-trained models for RoBERTa and DistilBERT uploaded
    • Some smaller bug-fixes
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Aug 20, 2019)

    No breaking changes. Just update with pip install -U sentence-transformers

    Bugfixes:

    • SentenceTransformers can now be used with Windows (threw an exception before about invalid tensor types before)
    • Outputs a warning if seq. length for BERT / RoBERTa is too long

    Improvements:

    • A flag can be set to hide the progress bar when a dataset is convert or an evaluator is executed
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Aug 19, 2019)

    Updated pytorch-transformers to v1.1.0. Adding support for RoBERTa model.

    Bugfixes:

    • Critical bugfix for SoftmaxLoss: Classifier weights were not optimized in previous version
    • Minor fix for including the timestamp of the output folders
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Aug 16, 2019)

  • v0.2.0(Aug 16, 2019)

    v0.2.0 completely changes the architecture of sentence transformers.

    The new architecture is based on a sequential architecture: You define individual models that transform step-by-step a sentence to a fixed sized sentence embedding.

    The modular architecture allows to easily swap different components. You can choose between different embedding methods (BERT, XLNet, word embeddings), transformations (LSTM, CNN), weighting & pooling methods as well as adding deep averaging networks.

    New models in this release:

    • Word Embeddings (like GloVe) for computation of average word embeddings
    • Word weighting, for example, with tf-idf values
    • BiLSTM and CNN encoder, for example, to re-create the InferSent model
    • Bag-of-Words (BoW) sentence representation. Optionally also with tf-idf weighting.

    This release has many breaking changes with the previous release. If you need help with the migration, open a new issue.

    New model storing procedure: Each sub-module is stored in its own subfolder. If you need to migrate old models, it is best to create the subfolder structure by the system (model.save()) and then to copy the pytorch_model.bin into the correct subfolder.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Jul 25, 2019)

Owner
Ubiquitous Knowledge Processing Lab
Ubiquitous Knowledge Processing Lab
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.1k Apr 16, 2022
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 596 Apr 17, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 434 Apr 15, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 44 Mar 24, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2k Apr 25, 2022
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Apr 21, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 14 Apr 8, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 19 Apr 10, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

null 26 Apr 22, 2022
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 16 Apr 21, 2022
An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Khalid Saifullah 38 Jan 8, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 13 Jun 24, 2021
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 37 Apr 2, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 44 Apr 7, 2022