Active learning for text classification in Python

Overview

PyPI codecov Documentation Status GitHub

small-text logo

Active Learning for Text Classifcation in Python.


Installation | Quick Start | Docs


Active Learning allows you to efficiently label training data in a small-data scenario.

This library provides state-of-the-art active learning for text classification which allows to easily mix and match many classifiers and query strategies to build active learning experiments or applications.

Features

  • Provides unified interfaces for Active Learning so that you can easily use any classifier provided by sklearn.
  • (Optionally) As an optional feature, you can also use pytorch classifiers, including transformers models.
  • Multiple scientifically-proven strategies re-implemented: Query Strategies, Initialization Strategies

Installation

Small-text can be easily installed via pip:

pip install small-text

For a full installation include the transformers extra requirement:

pip install small-text[transformers]

Requires Python 3.7 or newer. For using the GPU, CUDA 10.1 or newer is required. More information regarding the installation can be found in the documentation.

Quick Start

For a quick start, see the provided examples for binary classification, pytorch multi-class classification, or transformer-based multi-class classification

Documentation

Read the latest documentation (currently work in progress) here.

Alternatives

Contribution

Contributions are welcome. Details can be found in CONTRIBUTING.md.

Acknowledgments

This software was created by @chschroeder at Leipzig University's NLP group which is a part of the Webis research network. The encompassing project was funded by the Development Bank of Saxony (SAB) under project number 100335729.

Citation

A preprint which introduces small-text is available here:
Small-text: Active Learning for Text Classification in Python.

@misc{schroeder2021smalltext,
    title={Small-text: Active Learning for Text Classification in Python}, 
    author={Christopher Schröder and Lydia Müller and Andreas Niekler and Martin Potthast},
    year={2021},
    eprint={2107.10314},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

License

MIT License

Comments
  • initialize_active_learner error

    initialize_active_learner error

    I am trying to initialize a active learner for text classification using transformer. I have 11014 classes which need to be trained by the classification model. My data set is highly imbalanced. While doing the initialize_active_learner( active_learner, y_train) I have used

    def initialize_active_learner(active_learner, y_train):
    
        x_indices_initial = random_initialization(y_train)
        #random_initialization_stratified(y_train, n_samples=11015)
        #random_initialization_balanced(y_train)
        
        y_initial = np.array([y_train[i] for i in x_indices_initial])
    
        active_learner.initialize_data(x_indices_initial, y_initial)
    
        return x_indices_initial
    

    But I get this error always:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-23-d0348c5b7547> in <module>
          1 # Active learner
          2 active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, x_train)
    ----> 3 labeled_indices = initialize_active_learner(active_learner, y_train)
          4 #
    
    <ipython-input-22-ed58e0714c48> in initialize_active_learner(active_learner, y_train)
         17     y_initial = np.array([y_train[i] for i in x_indices_initial])
         18 
    ---> 19     active_learner.initialize_data(x_indices_initial, y_initial)
         20 
         21     return x_indices_initial
    
    ~/.local/lib/python3.7/site-packages/small_text/active_learner.py in initialize_data(self, x_indices_initial, y_initial, x_indices_ignored, x_indices_validation, retrain)
        139 
        140         if retrain:
    --> 141             self._retrain(x_indices_validation=x_indices_validation)
        142 
        143     def query(self, num_samples=10, x=None, query_strategy_kwargs=None):
    
    ~/.local/lib/python3.7/site-packages/small_text/active_learner.py in _retrain(self, x_indices_validation)
        380 
        381         if x_indices_validation is None:
    --> 382             self._clf.fit(x)
        383         else:
        384             indices = np.arange(self.x_indices_labeled.shape[0])
    
    ~/.local/lib/python3.7/site-packages/small_text/integrations/transformers/classifiers/classification.py in fit(self, train_set, validation_set, optimizer, scheduler)
        332         self.class_weights_ = self.initialize_class_weights(sub_train)
        333 
    --> 334         return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
        335 
        336     def initialize_class_weights(self, sub_train):
    
    ~/.local/lib/python3.7/site-packages/small_text/integrations/transformers/classifiers/classification.py in _fit_main(self, sub_train, sub_valid, optimizer, scheduler)
        351                 raise ValueError('Conflicting information about the number of classes: '
        352                                  'expected: {}, encountered: {}'.format(self.num_classes,
    --> 353                                                                         np.max(y) + 1))
        354 
        355             self.initialize_transformer(self.cache_dir)
    
    ValueError: Conflicting information about the number of classes: expected: 11014, encountered: 8530
    

    Please help here.

    Thanks in advance

    opened by neel17 8
  • Getting error 'RuntimeError: expected scalar type Long but found Int' while running the starting code

    Getting error 'RuntimeError: expected scalar type Long but found Int' while running the starting code

    Bug description

    I am getting the following error

    RuntimeError: expected scalar type Long but found Int

    related to the line

    indices_labeled = initialize_active_learner(active_learner, train.y)

    in the code provided here

    https://github.com/webis-de/small-text/blob/v1.1.1/examples/notebooks/02-active-learning-with-stopping-criteria.ipynb

    I am using the latest version.

    Python version: 3.8.8 small-text version: 1.1.1 torch version (if applicable): 1.13.0+cpu

    Full error:

    RuntimeError Traceback (most recent call last) in 28 29 active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train) ---> 30 indices_labeled = initialize_active_learner(active_learner, train.y) 31

    in initialize_active_learner(active_learner, y_train) 12 13 indices_initial = random_initialization_balanced(y_train, n_samples=20) ---> 14 active_learner.initialize_data(indices_initial, y_train[indices_initial]) 15 16 return indices_initial

    ~\Anaconda3\lib\site-packages\small_text\active_learner.py in initialize_data(self, indices_initial, y_initial, indices_ignored, indices_validation, retrain) 149 150 if retrain: --> 151 self._retrain(indices_validation=indices_validation) 152 153 def query(self, num_samples=10, representation=None, query_strategy_kwargs=dict()):

    ~\Anaconda3\lib\site-packages\small_text\active_learner.py in _retrain(self, indices_validation) 388 389 if indices_validation is None: --> 390 self._clf.fit(dataset) 391 else: 392 indices = np.arange(self.indices_labeled.shape[0])

    ~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in fit(self, train_set, validation_set, weights, early_stopping, model_selection, optimizer, scheduler) 366 use_sample_weights=weights is not None) 367 --> 368 return self._fit_main(sub_train, sub_valid, sub_train_weights, early_stopping, 369 model_selection, fit_optimizer, fit_scheduler) 370

    ~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _fit_main(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler) 389 390 with tempfile.TemporaryDirectory(dir=get_tmp_dir_base()) as tmp_dir: --> 391 self._train(sub_train, sub_valid, weights, early_stopping, model_selection, 392 optimizer, scheduler, tmp_dir) 393 self._perform_model_selection(optimizer, model_selection)

    ~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _train(self, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir) 435 start_time = datetime.datetime.now() 436 --> 437 train_acc, train_loss, valid_acc, valid_loss, stop = self._train_loop_epoch(epoch, 438 sub_train, 439 sub_valid,

    ~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _train_loop_epoch(self, num_epoch, sub_train, sub_valid, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir) 471 validate_every = None 472 --> 473 train_loss, train_acc, valid_loss, valid_acc, stop = self._train_loop_process_batches( 474 num_epoch, 475 sub_train,

    ~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in train_loop_process_batches(self, num_epoch, sub_train, sub_valid_, weights, early_stopping, model_selection, optimizer, scheduler, tmp_dir, validate_every) 505 for i, (x, masks, cls, weight, *_) in enumerate(train_iter): 506 if not stop: --> 507 loss, acc = self._train_single_batch(x, masks, cls, weight, optimizer) 508 scheduler.step() 509

    ~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _train_single_batch(self, x, masks, cls, weight, optimizer) 561 outputs = self.model(x, attention_mask=masks) 562 --> 563 logits, loss = self._compute_loss(cls, outputs) 564 loss = loss * weight 565 loss = loss.mean()

    ~\Anaconda3\lib\site-packages\small_text\integrations\transformers\classifiers\classification.py in _compute_loss(self, cls, outputs) 585 logits = outputs.logits.view(-1, self.num_classes) 586 target = cls --> 587 loss = self.criterion(logits, target) 588 589 return logits, loss

    ~\Anaconda3\lib\site-packages\torch\nn\modules\module.py in _call_impl(self, *input, **kwargs) 1188 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1189 or _global_forward_hooks or _global_forward_pre_hooks): -> 1190 return forward_call(*input, **kwargs) 1191 # Do not call functions when jit is used 1192 full_backward_hooks, non_full_backward_hooks = [], []

    ~\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in forward(self, input, target) 1172 1173 def forward(self, input: Tensor, target: Tensor) -> Tensor: -> 1174 return F.cross_entropy(input, target, weight=self.weight, 1175 ignore_index=self.ignore_index, reduction=self.reduction, 1176 label_smoothing=self.label_smoothing)

    ~\Anaconda3\lib\site-packages\torch\nn\functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing) 3024 if size_average is not None or reduce is not None: 3025 reduction = _Reduction.legacy_get_string(size_average, reduce) -> 3026 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) 3027 3028

    RuntimeError: expected scalar type Long but found Int

    bug 
    opened by Nim248 5
  • SEALS: Similarity Search for Efficient Active Learning and Search of Rare Concepts

    SEALS: Similarity Search for Efficient Active Learning and Search of Rare Concepts

    Hello, thank you for open-sourcing this project. I would like to suggest adding the following method to the library: "Similarity Search for Efficient Active Learning and Search of Rare Concepts" Link: https://arxiv.org/abs/2007.00077 It seems that it can it well in this library, it is also possible to combine that with other methods. Sincerely, Kamer

    feature request 
    opened by kayuksel 4
  • Incremental Training Documentation

    Incremental Training Documentation

    In active_learner.py, the incremental training parameter is described as:

    incremental_training : bool
            If False, creates and trains a new classifier only before the first query,
            otherwise re-trains the existing classifier. Incremental training must be supported
            by the classifier provided by `clf_factory`."
    

    Is there a way to retrain the model from scratch after each queried batch? This documentation suggests we are updating the existing classifier in both cases as even when False, it "creates and trains a new classifier only before the first query."

    Thank you!

    documentation 
    opened by HannahKirk 4
  • Adding special tokens to tokenizer (transformers-integration)

    Adding special tokens to tokenizer (transformers-integration)

    I need to add some special tokens to the BERT tokenizer. However, I am not sure how to resize the model tokenizer to incorporate the added special tokens with the small-text transformers integration.

    With transformers, you can add special tokens using:

    tokenizer.add_tokens(['newWord', 'newWord2'])
    model.resize_token_embeddings(len(tokenizer)
    

    How does this change with a clf_factory and initialising the transformers model as a pool based active learner? E.g. with the code from the 01-active-learning-for-text-classification-with-small-text-intro.ipynb notebook:

    from small_text.integrations.transformers.datasets import TransformersDataset
    
    
    def get_transformers_dataset(tokenizer, data, labels, max_length=60):
    
        data_out = []
    
        for i, doc in enumerate(data):
            encoded_dict = tokenizer.encode_plus(
                doc,
                add_special_tokens=True,
                padding='max_length',
                max_length=max_length,
                return_attention_mask=True,
                return_tensors='pt',
                truncation='longest_first'
            )
    
            data_out.append((encoded_dict['input_ids'], encoded_dict['attention_mask'], labels[i]))
    
        return TransformersDataset(data_out)
    
    
    train = get_transformers_dataset(tokenizer, raw_dataset['train']['text'], raw_dataset['train']['label'])
    test = get_transformers_dataset(tokenizer, raw_dataset['test']['text'], raw_dataset['test']['label'])
    
    transformer_model = TransformerModelArguments(transformer_model_name)
    clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                        num_classes, 
                                                        kwargs=dict({'device': 'cuda', 
                                                                     'mini_batch_size': 32,
                                                                     'early_stopping_no_improvement': -1
                                                                    }))
    active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, train)
        
    
    question 
    opened by HannahKirk 4
  • Embeddings in EmbeddingKMeans and ContrastiveActiveLearning

    Embeddings in EmbeddingKMeans and ContrastiveActiveLearning

    Hi! Do they support embeddings from a language-agnostic model like LabSE or XLM-RoBERTa? (as this is not the case in their papers). Would it be possible to use any embeddings that we previously extract with those methods? If so, how we can do that? I believe that this could be very crucial for this library for not limiting its use to only English-language or any specific encoder.

    question 
    opened by kayuksel 3
  • Specifying multiple query strategies

    Specifying multiple query strategies

    When initialising a PoolBasedActiveLearner as active_learner then using active_learner.query(num_samples=20), it is possible to specify more than one query strategy i.e. select 5 examples by PredictionEntropy(), 5 by EmbeddingKMeans(), 5 by RandomSampling() etc.?

    I can initialise a new active learner object with a different query strategy for each sub-query but it would be great if you could specify multiple query strategies for the active learner.

    question 
    opened by HannahKirk 3
  • What are the best query strategies to use as a baseline approach?

    What are the best query strategies to use as a baseline approach?

    I'm not sure where to start to get a good baseline result with active learning for text classification. What query strategies should be attempted first? Is there something like this survey https://arxiv.org/abs/2203.13450 implemented for text classification?

    question 
    opened by renebidart 2
  • Quickstart Colab notebooks not working

    Quickstart Colab notebooks not working


    AttributeError Traceback (most recent call last) in 2 3 ----> 4 train = TransformersDataset.from_arrays(raw_dataset['train']['text'], 5 raw_dataset['train']['label'], 6 tokenizer,

    AttributeError: type object 'TransformersDataset' has no attribute 'from_arrays'

    bug 
    opened by kbschliep 2
  • fit() got an unexpected keyword argument 'validation_set'

    fit() got an unexpected keyword argument 'validation_set'

    Hi,

    I'm initializing an active learner for an Sklearn model with specific validation indices. Minimal code example is:

    def initialize_learner(learner, train, test_sets, init_n): 
      print('\n----Initalising----\n')
      iter_results_dict = {}
      iter_preds_dict = {}
      #Initialize the model - This is required for model-based query strategies.
      indices_neg_label = np.where(train.y == 0)[0]
      indices_pos_label = np.where(train.y == 1)[0]
    if init_n ==4:
         x_indices_initial = np.concatenate([np.random.choice(indices_pos_label, int(init_n/2), replace=False),
      np.random.choice(indices_neg_label, int(init_n/2), replace=False)])
          x_indices_initial = x_indices_initial.astype(int)
          y_initial = np.array([train.y[i] for i in x_indices_initial])
          val_indices = x_indices_initial[1:3]
          learner.initialize_data(x_indices_initial, y_initial, x_indices_validation=val_indices) # use half indices for validation
     iter_results_dict[0], iter_preds_dict[0] = evaluate(learner, train[x_indices_initial], test_sets, x_indices_initial)
     return learner, x_indices_initial, iter_results_dict, iter_preds_dict 
    

    The error I am getting is fit() got an unexpected keyword argument 'validation_set'. Digging into the code, it seems like if you pass x_indices_validation as not None this shouldn't happen.

    Do you have any suggestions?

    opened by HannahKirk 2
  • arrays doesn't match.

    arrays doesn't match.

    I tried multi classification, but the following error occurs when training. any solution?.

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-97-34924934fd19> in <module>
          1 logging.getLogger('small_text').setLevel(logging.INFO)
    ----> 2 main()
    
    <ipython-input-96-e3cc4fd7354b> in main()
         30         for i in range(20):
         31             # ...where each iteration consists of labelling 20 samples
    ---> 32             q_indices = active_learner.query(num_samples=20, x=train)
         33 
         34             # Simulate user interaction here. Replace this for real-world usage.
    
    /opt/anaconda3/envs/small_text/lib/python3.7/site-packages/small_text-1.0.0a4-py3.7.egg/small_text/active_learner.py in query(self, num_samples, x, query_strategy_kwargs)
        175 
        176         self.mask = np.ones(size, bool)
    --> 177         self.mask[np.concatenate([self.x_indices_labeled, self.x_indices_ignored])] = False
        178         indices = np.arange(size)
        179 
    
    <__array_function__ internals> in concatenate(*args, **kwargs)
    
    ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 0 dimension(s)
    
    opened by aditya624 2
  • Query strategy that includes selecting high/medium certainty examples

    Query strategy that includes selecting high/medium certainty examples

    Feature description

    The existing query strategies mostly seem to select data the model is particularly uncertain about (high entropy, ties, least confident ...). Are there other query strategies that also mix some data points into the training pool where the model is more certain?

    Motivation

    Many use-cases I work on deal with noisy data. So after a model has obtained a certain quality, query strategies that only select uncertain examples can actually select data that is of low quality. Instead, it would be good to have a way of also adding some high or medium certainty examples to the training pool. The idea is that this helps the model get some good, not-so-difficult examples to help it learn the task - instead of always feeding it very difficult and potentially noisy/wrong data points that can hurt performance.

    This is also an important use-case for zero-shot or few-shot models (like the Hugging Face zero-pipeline), which are getting more and more popular. They already have decent accuracy for the task and selecting highly uncertain examples can actually hurt the training process by selecting noise / examples that are inherently uncertain.

    Addition comments

    I really like your library and planning on using it for my research in the coming months :)

    feature request 
    opened by MoritzLaurer 6
  • LightweightCoreset should be batched

    LightweightCoreset should be batched

    Feature description

    The lightweight_coreset function should compute the distances in batches similar to greedy_coreset. Therefore a batch_size kwarg needs to be added and integrated into the function in the same manner. This keyword must also be added to LightweightCoreset (query strategy) and passed in the function call (similar to GreedyCoreset).

    Motivation

    This will reduce max memory used and, moreover, will align the lightweight and greedy coreset implementations.

    Addition comments

    Everything that needs to be adapted is currently located under small_text.query_strategies.coresets.

    feature request good first issue 
    opened by chschroeder 0
  • Pass local_files_only kwarg in TransformerBasedClassification

    Pass local_files_only kwarg in TransformerBasedClassification

    Feature description

    Provide a way to set local_files_only in TransformerBasedClassification. https://github.com/huggingface/transformers/issues/2867

    Motivation

    The integration tests are too slow and a majority of the time can be avoided with this setting. Moreover, in environments without an internet connection the current state will fail.

    Addition comments

    feature request 
    opened by chschroeder 2
  • Mulitlabel: Clf.predict(return_proba=True) only returns probabilities for labels over the threshold

    Mulitlabel: Clf.predict(return_proba=True) only returns probabilities for labels over the threshold

    Some query strategies require the probabilities for all labels of a sample, currently only probabilities for successfully predicted labels are returned.

    feature request good first issue 
    opened by KimBue 0
  • Setting up a PoolBaseActiveLearner without initialization.

    Setting up a PoolBaseActiveLearner without initialization.

    Hi, I am training a transformers model in a separate script over a pre-defined training set. I want to then use this classifier to query examples from the unlabelled pool. I can load the trained model from pre-trained pytorch model files or from PoolBasedActiveLearner.load('test-model/active_leaner.pkl').

    However, I then don't want to initialise this model as it has already been trained on a portion of the labelled data. Is it possible to still query over data i.e. learner.query() without running the initialization step learner.initialize_data(x_indices_train, y_train, x_indices_validation=val_indices)?

    Alternatively is it possible to still run this initialisation step but without running any training, i.e. just ignoring all indices for initialisation or setting the number of initialisation examples to zero in x_indices_initial = random_initialization(y_train, n_samples=0).

    Really appreciate your help on this one!

    Thanks :)

    documentation 
    opened by HannahKirk 10
  • active_learner.save('active_leaner.pkl'), can't pickle _abc_data objects

    active_learner.save('active_leaner.pkl'), can't pickle _abc_data objects

    Hi,

    I've trained an active_learner object, now trying to save it to file.

    According to the doc: https://small-text.readthedocs.io/en/latest/patterns/serialization.html active_learner.save('active_leaner.pkl') should work but I get the following error:

    TypeError                                 Traceback (most recent call last)
    <ipython-input-79-3c088eb07e76> in <module>()
          1 
    ----> 2 active_learner.save(f"{DIR}/results/active_leaner.pkl")
    
    22 frames
    /usr/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
        522             reduce = getattr(obj, "__reduce_ex__", None)
        523             if reduce is not None:
    --> 524                 rv = reduce(self.proto)
        525             else:
        526                 reduce = getattr(obj, "__reduce__", None)
    
    TypeError: can't pickle _abc_data objects
    

    I can extract the transformer model and save that instead using active_learner.classifier.model.save_pretrained(f"{directory}") but not using active_learner.save()

    bug 
    opened by HannahKirk 6
Releases(v1.1.1)
  • v1.1.1(Oct 14, 2022)

  • v1.1.0(Oct 1, 2022)

    This release adds a conda package, more convenient imports, and improves many aspects of the classifcation functionality. Moreover, one new query strategy and three stopping criteria have been added.

    Added

    General

    • Small-Text package is now available via conda-forge.
    • Imports have been reorganized. You can import all public classes and methods from the top-level package (small_text):
      from small_text import PoolBasedActiveLearner
      

    Classification

    • All classifiers now support weighting of training samples.
    • Early stopping has been reworked, improved, and documented (#18).
    • Model selection has been reworked and documented.
    • [!] KimCNNClassifier.__init()__: The default value of the (now deprecated) keyword argument early_stopping_acc has been changed from 0.98 to -1 in order to match TransformerBasedClassification.
    • [!] Removed weight renormalization after gradient clipping.

    Datasets

    • The target_labels keyword argument in __init()__ will now raise a warning if not passed.
    • Added from_arrays() to SklearnDataset, PytorchTextClassificationDataset, and TransformersDataset to construct datasets more conveniently.

    Query Strategies

    Stopping Criteria

    Deprecated

    • small_text.integrations.pytorch.utils.misc.default_tensor_type() is deprecated without replacement (#2).
    • TransformerBasedClassification and KimCNNClassifier: The keyword arguments for early stopping (early_stopping / early_stopping_no_improvement, early_stopping_acc) that are passed to __init__() are now deprecated. Use the early_stopping keyword argument in the fit() method instead (#18).

    Fixed

    Classification

    • KimCNNClassifier.fit() and TransformerBasedClassification.fit() now correctly process the scheduler keyword argument (#16).

    Removed

    • Removed the strict check that every target label has to occur in the training data. (This is intended for multi-label settings with many labels; apart from that it is still recommended to make sure that all labels occur.)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Sep 12, 2022)

    Minor bug fix release.

    Fixed

    Links to notebooks and code examples will now always point to the latest release instead of the latest main branch.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jun 14, 2022)

    This is the first stable release 🎉! The release mainly consists of code cleanup, documentation, and repository organization.

    • Datasets:
      • SklearnDataset now checks if the dimensions of features and labels match.
    • Query Strategies:
    • Documentation:
      • The html documentation uses the full screen width.
    • Repository:
      • This repository can now be referenced using the respective Zenodo DOI.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0b4(May 4, 2022)

    This release adds two no query strategies, improves the Dataset interface, and introduces optional dependencies.

    Added

    • General:
      • We now have a concept for optional dependencies which allows components to rely on soft dependencies, i.e. python dependencies which can be installed on demand (and only when certain functionality is needed).
    • Datasets:
      • The Dataset interface now has a clone() method that creates an identical copy of the respective dataset.
    • Query Strategies:

    Changed

    • Datasets:
      • Separated the previous DatasetView implementation into interface (DatasetView) and implementation (SklearnDatasetView).
      • Added clone() method which creates an identical copy of the dataset.
    • Query Strategies:
      • EmbeddingBasedQueryStrategy now only embeds instances that are either in the label or in the unlabeled pool (and no longer the entire dataset).
    • Code examples:
      • Code structure was unified.
      • Number of iterations can now be passed via an cli argument.
    • small_text.integrations.pytorch.utils.data:
      • Method get_class_weights() now scales the resulting multi-class weights so that the smallest class weight is equal to 1.0.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0b3(Mar 6, 2022)

    This release adds a new query strategy, improves the docs, and cleans up the interfaces in preparation of v1.0.0.

    Added

    Changed

    • Cleaned up and unified argument naming: The naming of variables related to datasets and indices has been improved and unified. The naming of datasets had been inconsistent, and the previous x_ notation for indices was a relict of earlier versions of this library and did not reflect the underlying object anymore.

      • PoolBasedActiveLearner:

        • attribute x_indices_labeled was renamed to indices_labeled
        • attribute x_indices_ignored was unified to indices_ignored
        • attribute queried_indices was unified to indices_queried
        • attribute _x_index_to_position was named to _index_to_position
        • arguments x_indices_initial, x_indices_ignored, and x_indices_validation were renamed to indices_initial, indices_ignored, and indices_validation. This affects most methods of the PoolBasedActiveLearner.
      • QueryStrategy

        • old: query(self, clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10)
        • new: query(self, clf, dataset, indices_unlabeled, indices_labeled, y, n=10)
      • StoppingCriterion

        • old: stop(self, active_learner=None, predictions=None, proba=None, x_indices_stopping=None)
        • new: stop(self, active_learner=None, predictions=None, proba=None, indices_stopping=None)
    • Renamed environment variable which sets the small-text temp folder from ALL_TMP to SMALL_TEXT_TEMP

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0b2(Feb 22, 2022)

    This release fixes some broken links which were caused due to the recent change in naming the git tags (1.0.0a8 -> v1.0.0b1).

    Fixed

    • Fix links to the documentation in README.md and notebooks.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0b1(Feb 22, 2022)

    First beta release with multi-label functionality and stopping criteria. Added/revised large parts of the documentation.

    Added

    • Added a changelog.
    • All provided classifiers are now capable of multi-label classification.

    Changed

    • Documentation has been overhauled considerably.
    • PoolBasedActiveLearner: Renamed incremental_training kwarg to reuse_model.
    • SklearnClassifier: Changed __init__(clf) to __init__(model, num_classes, multi_Label=False)
    • SklearnClassifierFactory: __init__(clf_template, kwargs={}) to __init__(base_estimator, num_classes, kwargs={}).
    • Refactored KimCNNClassifier and TransformerBasedClassification.

    Removed

    • Removed device kwarg from PytorchDataset.__init__(), PytorchTextClassificationDataset.__init__() and TransformersDataset.__init__().
    Source code(tar.gz)
    Source code(zip)
Owner
Webis
Web Technology & Information Systems Group (Webis Group)
Webis
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization This repo is for our paper "Enhanced Seq2Seq Autoencode

Rachel Zheng 14 Nov 01, 2022
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 37 Nov 06, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2

Google Research Datasets 52 Jun 21, 2022
Problem: Given a nepali news find the category of the news

Classification of category of nepali news catorgory using different algorithms Problem: Multiclass Classification Approaches: TFIDF for vectorization

pudasainishushant 2 Jan 09, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022