skweak: A software toolkit for weak supervision applied to NLP tasks

Last update: Dec 28, 2022

Overview

skweak: Weak supervision for NLP

Labelled data remains a scarce resource in many practical NLP scenarios. This is especially the case when working with resource-poor languages (or text domains), or when using task-specific labels without pre-existing datasets. The only available option is often to collect and annotate texts by hand, which is expensive and time-consuming.

skweak (pronounced /skwi:k/) is a Python-based software toolkit that provides a concrete solution to this problem using weak supervision. skweak is built around a very simple idea: Instead of annotating texts by hand, we define a set of labelling functions to automatically label our documents, and then aggregate their results to obtain a labelled version of our corpus.

The labelling functions may take various forms, such as domain-specific heuristics (like pattern-matching rules), gazetteers (based on large dictionaries), machine learning models, or even annotations from crowd-workers. The aggregation is done using a statistical model that automatically estimates the relative accuracy (and confusions) of each labelling function by comparing their predictions with one another.

skweak can be applied to both sequence labelling and text classification, and comes with a complete API that makes it possible to create, apply and aggregate labelling functions with just a few lines of code. The toolkit is also tightly integrated with SpaCy, which makes it easy to incorporate into existing NLP pipelines. Give it a try!

Full Paper:
Pierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), "skweak: Weak Supervision Made Easy for NLP", arXiv:2104.09683.

Documentation & API: See the Wiki for details on how to use skweak.

121_file_Video.mp4

Dependencies

spacy >= 3.0.0
hmmlearn >= 0.2.4
pandas >= 0.23
numpy >= 1.18

You also need Python >= 3.6.

Install

The easiest way to install skweak is through pip:

pip install skweak

or if you want to install from the repo:

pip install --user git+https://github.com/NorskRegnesentral/skweak

The above installation only includes the core library (not the additional examples in examples).

Basic Overview

Weak supervision with skweak goes through the following steps:

Start: First, you need raw (unlabelled) data from your text domain. skweak is build on top of SpaCy, and operates with Spacy Doc objects, so you first need to convert your documents to Doc objects using SpaCy.
Step 1: Then, we need to define a range of labelling functions that will take those documents and annotate spans with labels. Those labelling functions can comes from heuristics, gazetteers, machine learning models, etc. See the for more details.
Step 2: Once the labelling functions have been applied to your corpus, you need to aggregate their results in order to obtain a single annotation layer (instead of the multiple, possibly conflicting annotations from the labelling functions). This is done in skweak using a generative model that automatically estimates the relative accuracy and possible confusions of each labelling function.
Step 3: Finally, based on those aggregated labels, we can train our final model. Step 2 gives us a labelled corpus that (probabilistically) aggregates the outputs of all labelling functions, and you can use this labelled data to estimate any kind of machine learning model. You are free to use whichever model/framework you prefer.

Quickstart

Here is a minimal example with three labelling functions (LFs) applied on a single document:

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils

# LF 1: heuristic to detect occurrences of MONEY entities
def money_detector(doc):
   for tok in doc[1:]:
      if tok.text[0].isdigit() and tok.nbor(-1).is_currency:
          yield tok.i-1, tok.i+1, "MONEY"
lf1 = heuristics.FunctionAnnotator("money", money_detector)

# LF 2: detection of years with a regex
lf2= heuristics.TokenConstraintAnnotator("years", lambda tok: re.match("(19|20)\d{2}$", tok.text), "DATE")

# LF 3: a gazetteer with a few names
NAMES = [("Barack", "Obama"), ("Donald", "Trump"), ("Joe", "Biden")]
trie = gazetteers.Trie(NAMES)
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":trie})

# We create a corpus (here with a single text)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Donald Trump paid $750 in federal income taxes in 2016")

# apply the labelling functions
doc = lf3(lf2(lf1(doc)))

# and aggregate them
hmm = aggregation.HMM("hmm", ["PERSON", "DATE", "MONEY"])
hmm.fit_and_aggregate([doc])

# we can then visualise the final result (in Jupyter)
utils.display_entities(doc, "hmm")

Obviously, to get the most out of skweak, you will need more than three labelling functions. And, most importantly, you will need a larger corpus including as many documents as possible from your domain, so that the model can derive good estimates of the relative accuracy of each labelling function.

Documentation

See the Wiki.

License

skweak is released under an MIT License.

The MIT License is a short and simple permissive license allowing both commercial and non-commercial use of the software. The only requirement is to preserve the copyright and license notices (see file License). Licensed works, modifications, and larger works may be distributed under different terms and without source code.

Citation

See our paper describing the framework:

Pierre Lison, Jeremy Barnes and Aliaksandr Hubin (2021), "skweak: Weak Supervision Made Easy for NLP", arXiv:2104.09683

@misc{lison2021skweak,
      title={skweak: Weak Supervision Made Easy for NLP}, 
      author={Pierre Lison and Jeremy Barnes and Aliaksandr Hubin},
      year={2021},
      eprint={2104.09683},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

Label Function Analysis

First of all, thanks for open sourcing such an awesome project!

Our team has been playing around skweak for a sequential labeling task, and we were wondering if there were any plans in the roadmap to include tooling that helps practitioners understand the "impact" of their label functions statistically.

Snorkel for example, provides a LF Analysis tool to understand how one's label functions apply to a dataset statistically (e.g., coverage, overlap, conflicts). Similar functionality would be tremendously helpful in gauging the efficacy of one's label functions for each class in a sequential labeling problem.

Are there any plans to add such functionality down the line as a feature enhancement?
enhancement

opened by schopra8 20
Tokens with no possible state

I very often get the error of this line that there is a "problem with token X", causing HMM training to be aborted after only a couple of documents in the very first iteration.

I found out that this is due to framelogprob having all -np.inf for the token in question. So I checked what happens in self._compute_log_likelihood for the respective document and found that this document had only one labeling function firing and X[source] in this line was all False for the first token (or state?).

This means that this token/state is also all masked with -np.inf in logsum in this line.

Now, I am unsure how to fix that. This clearly does not look like the desired behavior but I suppose "testing for tokens with no possible states" is there for a reason. Can I simply replace -np.inf in self._compute_log_likelihood with -100000 ? Then, of course, the test will not fail and not abort training but there will be a token with only very improbable states. Is that ok?

Or is that the wrong approach? Should tokens without observed labels from the labeling functions rather get a default label (e.g., O)? So why is that not done here? Is it a bug? I am not sure where I should look for a bug, if there is one. Can someone with a better knowledge of the code base give some advice on this?

opened by mnschmit 10
_do_forward_pass, _do_backward_pass, _compute_posteriors not defined in skweak.aggregation

skweak/aggregation.py", line 405, in fit logprob, fwdlattice = self._do_forward_pass(framelogprob) AttributeError: 'HMM' object has no attribute '_do_forward_pass'

opened by ManuBohra 10
TypeError: unhashable type: 'list'

Upon applying config file in order to train textcat model using the following code:

!spacy init config - --lang en --pipeline ner --optimize accuracy | \ spacy train - --paths.train ./train.spacy --paths.dev ./train.spacy \ --initialize.vectors en_core_web_md --output train

I receive following error message:

[i] Saving to output directory: train [i] Using CPU

=========================== Initializing pipeline =========================== 2022-03-27 15:49:59.778883: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.778913: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2022-03-27 15:49:59.798942: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2022-03-27 15:49:59.798976: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. [2022-03-27 15:50:05,376] [INFO] Set up nlp object from config [2022-03-27 15:50:05,395] [INFO] Pipeline: ['tok2vec', 'ner'] [2022-03-27 15:50:05,395] [INFO] Created vocabulary [2022-03-27 15:50:07,968] [INFO] Added vectors: en_core_web_md [2022-03-27 15:50:08,292] [INFO] Finished initializing nlp object Traceback (most recent call last): File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\ProgramData\Anaconda3\Scripts\spacy.exe_main.py", line 7, in File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli_util.py", line 71, in setup_cli command(prog_name=COMMAND) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 829, in call return self.main(*args, **kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1259, in invoke return process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\typer\main.py", line 497, in wrapper return callback(**use_params) # type: ignore File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 45, in train_cli train(config_path, output_path, use_gpu=use_gpu, overrides=overrides) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\cli\train.py", line 72, in train nlp = init_nlp(config, use_gpu=use_gpu) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\initialize.py", line 84, in init_nlp nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\language.py", line 1308, in initialize proc.initialize(get_examples, nlp=self, **p_settings) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\pipeline\tok2vec.py", line 215, in initialize validate_get_examples(get_examples, "Tok2Vec.initialize") File "spacy\training\example.pyx", line 65, in spacy.training.example.validate_get_examples File "spacy\training\example.pyx", line 44, in spacy.training.example.validate_examples File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 142, in call for real_eg in examples: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 164, in make_examples for reference in reference_docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\training\corpus.py", line 199, in read_docbin for doc in docs: File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_serialize.py", line 150, in get_docs doc.spans.from_bytes(self.span_groups[i]) File "C:\Users\49176\AppData\Roaming\Python\Python39\site-packages\spacy\tokens_dict_proxies.py", line 54, in from_bytes group = SpanGroup(doc).from_bytes(value_bytes) File "spacy\tokens\span_group.pyx", line 170, in spacy.tokens.span_group.SpanGroup.from_bytes File "C:\ProgramData\Anaconda3\lib\site-packages\srsly_msgpack_api.py", line 27, in msgpack_loads msg = msgpack.loads(data, raw=False, use_list=use_list) File "C:\ProgramData\Anaconda3\lib\site-packages\srsly\msgpack_init.py", line 79, in unpackb return _unpackb(packed, **kwargs) File "srsly\msgpack_unpacker.pyx", line 191, in srsly.msgpack._unpacker.unpackb TypeError: unhashable type: 'list'

Seems like a dependency issue. What is the reason for it? And is there a way to fix it?

Also : Is the following error message a problem ? "[E1010] Unable to set entity information for token 10 which is included in more than one span in entities, blocked, missing or outside." or can it be avoided by simply applying the following?: for document in train_data: try: document.ents = document.spans["hmm"] skweak.utils.docbin_writer(train_data, "train.spacy") except Exception as e: print(e)

opened by AlineBornschein 6
TypeError when nothing is found on in a document

Hi! I'm getting an exception from fit_and_aggregate. TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'. The exception is from line 227 in aggregation.py, np.apply_along_axis(...)

This seems to happen when all of my labeling functions return empty on one of the docs so the DataFrame is empty.

opened by oholter 6
Error in MultilabelNaiveBayes

I am using Skweak Multilabel for classification and I am getting the following error message - RuntimeError: No valid state found at position 0

I aggregated LFs using CombinedAnnotator, then initialized MultilabelNaiveBayes - MultilabelNaiveBayes("skweak_preds",final_label_list) and then trained the model - skweak_model.fit(d2s)

Any help in fixing this is appreciated. Thanks!

opened by sujeethrv 5
Converting .spacy files to conll format to train other models on it.

Once I fit the aggregation model on the data, I used Skweak's function to write it as a Docbin file which will get saved as a .spacy file. How do I convert this into a normal CoNLL format file. Are there any libraries or tools that can do that ?

opened by Akshay0799 5

Gazetteer is not working with single tokens

Hello.

Can't get why gazetteer doesn't match single name 'Barack'?

import spacy, re
from skweak import heuristics, gazetteers, aggregation, utils, base
nlp = spacy.load("en_core_web_sm", disable=["ner"])
doc = nlp('Barack Obama and Donald Trump')
NAMES = [("Barack"), ("Donald", "Trump")]
lf3 = gazetteers.GazetteerAnnotator("presidents", {"PERSON":gazetteers.Trie(NAMES)})
doc = lf3(doc)
print(doc.spans)

{'presidents': [Donald Trump]}

Any ideas?

Thanks for a remarkable lib!

opened by slavaGanzin 5

[Question] Underspecified Labels w/ out Fine-Grained Label
Context

I'm training an NER model using the HMM aggregator.

I have 2 label classes [A, B] and an under-specified label [C] which is a super-class of A and B within my ontology.

I have 3-sets of gazetteer label functions - one set for A, one set for B, and one set for C.

Issue

When training the HMM, I have tokens which are annotated by label functions for C (superclass) but are not annotated by label functions for A and B (e.g., the term "Apple" is being labeled as an ENT but is not being captured by the LFs for PER or PROD).

Currently I'm calling the HMM function as follows:

hmm = aggregation.HMM("hmm", [A, B], sequence_labelling=True) hmm.add_underspecified_label(C, [A, B]) _ = hmm.fit_and_aggregate(annotated_docs)

This triggers an error from the below aggregation code, since all probability mass is being placed on a label that was not included in the HMM (i.e., the under-specified label C). https://github.com/NorskRegnesentral/skweak/blob/0613f20b9c8be3f22553e303ec22c72dea1f206a/skweak/aggregation.py#L397-L401

Question(s)

Should I be including the under-specified label as a possible label option in the HMM?

hmm = aggregation.HMM("hmm", [A, B, C], sequence_labelling=True) hmm.add_underspecified_label(C, [A, B]) _ = hmm.fit_and_aggregate(annotated_docs)

How are underspecified labels "learned" or trained differently vs. the "specified labels" (e.g., A, B in the example)?

Thanks in advance!
opened by schopra8 5
use Flair with skweak

hello , is here anyone who tried to implement another model/framework other than spacy (ner) as a labeling function. i tried to work with flair but didnt work. can anyone help me and thanks in advance .

opened by Ihebzayen 4
Runtime error in display_entities

I am using the latest version of skweak: 0.2.17. I tried running the example (quick-start.ipynb) in the repo. When I try to execute

skweak.utils.display_entities(docs[28], "other_org_detector")

, I get this error.

opened by latchukarthick98 3
Step by step NER alternative 2

Hello,

First of all, thank you for the library.

I'm kind of new to NER, and I'd like to know how the 2nd alternative of the NER process would be done, where a more sophisticated model is created, since I didn't find it in Step by Step NER.

opened by boskis222 0
minimal example not working

When I try to run the minimal example on the home page, an error appears: AttributeError: 'BaseHMM' object has no attribute '_do_forward_log_pass'

Am I missing something from the install or is it just pip install skweak?

opened by davidbetancur8 2
Support options in displacy.render

This is enhance request for display_entities can be a bit more flexible if you includeoptions={} as part of their parameters. Ex: def display_entities(doc: Doc, layer=None, add_tooltip=False, options={}):

then fix the line below: html = spacy.displacy.render(doc2, jupyter=False, style="ent", manual=True, options=options)

That will extends the functionality of render when creating new entities.

Thanks for the great work with SKWEAK.

opened by lidiexy-palinode 0
Support for relation extraction

Right now, skweak supports two main types of NLP tasks: (token-level) sequence labelling and text classification. Both rests on the idea that labelling functions associate labels to text spans, and the role of the aggregation model is then to merge the outputs of those labelling functions such as to get unified predictions.

However, some NLP tasks cannot be easily associated to text spans. For instance, relation extraction necessitates a prediction on pairs of spans.

The question is then how to provide support for such type of tasks, for instance by implementing a RelationAnnotator that could be used to associate pairs of spans to a label.

Technically speaking, we could still encode the annotations internally as SpanGroup objects. One solution would be to only add one span of the pair in the SpanGroup, but then specify that this span is connected to a second span (SpanGroup objects allows the inclusion of JSON-serialised attributes). The method get_observation_df in the BaseAggregator class could then be extended to detect whether a span is a normal one, or is connected to a second span. If that is the case, the aggregation would then be done on pairs of spans instead of single spans.

Do get in touch if this functionality is something you need, so that we know whether we should prioritise this in our next release :-)
enhancement

opened by plison 4
Regression-based outcome

Hello, thank you for sharing this repo. Do you have plans for providing capability for a regression-based outcome? Something along the lines of fine-grained sentiment on a scale from 1-5?
enhancement

opened by dmracek 1

Releases(0.3.1)

0.3.1(Mar 25, 2022)
Brand new version of skweak, including both a number of bug fixes and some new functionalities:

skweak is now using the latest version of hmmlearn, thereby fixing a number errors due to a mismatch between method names

We now have a clearer split between aggregation models for sequence labelling and for text classification. Possible aggregators for sequence labelling are SequentialMajorityVoter and HMM (preferred), while the aggregators for non-sequential text classification are MajorityVoter and NaiveBayes.

We also introduce a brand new functionality: multi-label classification! Instead of assuming that all labels are mutually exclusive, you can now aggregate the results of labelling functions without assuming that only one label is correct. This multi-label scheme is available for both sequence labelling (see MultilabelSequentialMajorityVoter and MultilabelHMM) and text classification (see MultilabelMajorityVoter and MultilabelNaiveBayes).

By default, all labels can be simultaneously true for a given data point, but you can enforce exclusivity relations between labels through the method set_exclusive_labels. If all labels are set to be mutually exclusive, the aggregation is equivalent to a standard multi-class setup. Internally, this functionality is implemented by constructing and fitting separate aggregation models for each label.

The code for the aggregation models has also been heavily refactored, making it hopefully easier to create new aggregation models.
Source code(tar.gz)
Source code(zip)
0.2.8(Apr 19, 2021)

First official release of skweak, with support for both sequence labelling and text classification! See the documentation for details.
Source code(tar.gz)
Source code(zip)
btc.tar.gz(63.68 MB)
conll2003.spacy(4.56 MB)
conll2003.tar.gz(63.63 MB)
crunchbase.json.gz(8.55 MB)
muc6.spacy(3.37 MB)
norec.conllu.tar.gz(161.68 MB)
reuters_small.spacy(1.54 MB)
reuters_small.tar.gz(190.15 KB)
wikidata_small_tokenised.json.gz(11.36 MB)
wikidata_tokenised.json.gz(21.06 MB)

Owner

Norsk Regnesentral (Norwegian Computing Center)

Norwegian Computing Center is a private foundation performing research in statistical modeling, machine learning and information/communication technology

GitHub Repository

The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

3 Aug 19, 2022

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景安装教程快速上手（一）预训练模型（二）机器翻译（三）文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台，支持多种预训练方式，以及序列生成和自然语言理解任务。安装教程 git clone git

42 Dec 20, 2022

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

77.2k Jan 03, 2023

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

16 Nov 12, 2022

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

99 Jan 06, 2023

Problem: Given a nepali news find the category of the news

Classification of category of nepali news catorgory using different algorithms Problem: Multiclass Classification Approaches: TFIDF for vectorization

2 Jan 09, 2022

TFIDF-based QA system for AIO2 competition

AIO2 TF-IDF Baseline This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition. In the traini

4 Feb 19, 2022

Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022

NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

2k Jan 04, 2023

A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

244 Dec 30, 2022

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。本資料集從2,108篇

272 Dec 15, 2022

Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

423 Jan 01, 2023

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

169 Dec 21, 2022

Trained T5 and T5-large model for creating keywords from text

text to keywords Trained T5-base and T5-large model for creating keywords from text. Supported languages: ru Pretraining Large version | Pretraining B

61 Nov 24, 2022

SinglepassTextCluster, an TextCluster tools based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individual real-time corpus cluster task。基于single-pass算法思想的自动文本聚类小组件，内置tfidf和doc2vec两种文本向量方法，可自动输出聚类数目、类簇文档集合和簇类大小，用于自有实时数据的聚类任务。

项目的背景 SinglepassTextCluster, an TextCluster tool based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individ

34 Dec 18, 2022