BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

Last update: Jan 07, 2023

Overview

BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports guided, (semi-) supervised, and dynamic topic modeling. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found here and here.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with one of the examples below:

Name	Link
Topic Modeling with BERTopic
(Custom) Embedding Models in BERTopic
Advanced Customization in BERTopic
(semi-)Supervised Topic Modeling with BERTopic
Dynamic Topic Modeling with Trump's Tweets
Topic Modeling arXiv Abstracts

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
0	693	49_windows_drive_dos_file
1	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
3	381	22_key_encryption_keys_encrypted

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

We can create an overview of the most frequent topics in a way that they are easily interpretable. Horizontal barcharts typically convey information rather well and allow for an intuitive representation of the topics:

topic_model.visualize_barchart()

Find all possible visualizations with interactive examples in the documentation here.

Embedding Models

BERTopic supports many embedding models that can be used to embed the documents and words:

Sentence-Transformers
Flair
Spacy
Gensim
USE

Sentence-Transformers is typically used as it has shown great results embedding documents meant for semantic similarity. Simply select any from their documentation here and pass it to BERTopic:

topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")

Flair allows you to choose almost any 🤗 transformers model. Simply select any from here and pass it to BERTopic:

from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)

Click here for a full overview of all supported embedding models.

Dynamic Topic Modeling

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented over time. Here, we will be using all of Donald Trump's tweet to see how he talked over certain topics over time:

import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this by simply calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics:

topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)

Finally, we can visualize the topics by simply calling visualize_topics_over_time():

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)

Overview

For quick access to common functions, here is an overview of BERTopic's main methods:

Method	Code
Fit the model	`.fit(docs)`
Fit the model and predict documents	`.fit_transform(docs)`
Predict new documents	`.transform([new_doc])`
Access single topic	`.get_topic(topic=12)`
Access all topics	`.get_topics()`
Get topic freq	`.get_topic_freq()`
Get all topic information	`.get_topic_info()`
Get representative docs per topic	`.get_representative_docs()`
Get topics per class	`.topics_per_class(docs, topics, classes)`
Dynamic Topic Modeling	`.topics_over_time(docs, topics, timestamps)`
Update topic representation	`.update_topics(docs, topics, n_gram_range=(1, 3))`
Reduce nr of topics	`.reduce_topics(docs, topics, nr_topics=30)`
Find topics	`.find_topics("vehicle")`
Save model	`.save("my_model")`
Load model	`BERTopic.load("my_model")`
Get parameters	`.get_params()`

For an overview of BERTopic's visualization methods:

Method	Code
Visualize Topics	`.visualize_topics()`
Visualize Topic Hierarchy	`.visualize_hierarchy()`
Visualize Topic Terms	`.visualize_barchart()`
Visualize Topic Similarity	`.visualize_heatmap()`
Visualize Term Score Decline	`.visualize_term_rank()`
Visualize Topic Probability Distribution	`.visualize_distribution(probs[0])`
Visualize Topics over Time	`.visualize_topics_over_time(topics_over_time)`
Visualize Topics per Class	`.visualize_topics_per_class(topics_per_class)`

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.9.4},
  doi          = {10.5281/zenodo.4381785},
  url          = {https://doi.org/10.5281/zenodo.4381785}
}

Comments

Github actions: ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

The github actions workflow is suddenly giving me the following error:

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

It seems that it has most likely to do with numpy-based binary compatibility issues (some more info here). However, I cannot seem to fix it thus far with the suggested method (setting oldest-supported-numpy in pyproject.toml).

If you have any idea, please follow along with the full discussions here. Any help is greatly appreciated!

opened by MaartenGr 26
Train and Predict BERTopic

Hi @MaartenGr ,

As I understand about BERTopic; fit_transform() is to train model while transform() is for prediction. Am I right?? what is the best method to train the model for data from different sources e.g. twitter, reddit, facebook comments etc. I want to train the model once and use it for various datasets? should I have to divide data in sentences because some sources has very large comments (paragraphs) e.g. reddit or news articles?

Thanks

opened by mjavedgohar 26

Memory inefficient algorithm and getting error while saving the model

I was trying to train 20 Lakh data points and I have tried lots of GPU instances in AWS, I have tried GPU instances with 16GB RAM, 32GB RAM, 64 GB RAM, and 256 GB RAM on AWS. All of them failed and not able to train. And on 256 GB RAM, it was trained successfully but I was unable to save the model

Below is the error I was getting while saving the model.

topic_model.save("topic_model_all_20L.pt",save_embedding_model=False)

KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save(self, key, data)
    482             # If key already exists, we will overwrite the file
--> 483             data_name = overloads[key]
    484         except KeyError:
KeyError: ((array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C), array(float32, 2d, C), type(CPUDispatcher(<function alternative_cosine at 0x7f3c3ca174d0>)), array(int64, 1d, C), float64), ('x86_64-unknown-linux-gnu', 'cascadelake', '+64bit,+adx,+aes,+avx,+avx2,-avx512bf16,-avx512bitalg,+avx512bw,+avx512cd,+avx512dq,-avx512er,+avx512f,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,+avx512vl,+avx512vnni,-avx512vpopcntdq,+bmi,+bmi2,-cldemote,+clflushopt,+clwb,-clzero,+cmov,+cx16,+cx8,-enqcmd,+f16c,+fma,-fma4,+fsgsbase,+fxsr,-gfni,+invpcid,-lwp,+lzcnt,+mmx,+movbe,-movdir64b,-movdiri,-mwaitx,+pclmul,-pconfig,+pku,+popcnt,-prefetchwt1,+prfchw,-ptwrite,-rdpid,+rdrnd,+rdseed,-rtm,+sahf,-sgx,-sha,-shstk,+sse,+sse2,+sse3,+sse4.1,+sse4.2,-sse4a,+ssse3,-tbm,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop,+xsave,+xsavec,+xsaveopt,+xsaves'), ('308c49885ad3c35a475c360e21af1359caa88c78eb495fa0f5e8c6676ae5019e', 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'))
During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
<ipython-input-25-32c887ac8b59> in <module>
      1 # Saving model
----> 2 topic_model.save("topic_model_all_20L.pt",save_embedding_model=False)
      3 print("model saved")
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/bertopic/_bertopic.py in save(self, path, save_embedding_model)
   1201                 embedding_model = self.embedding_model
   1202                 self.embedding_model = None
-> 1203                 joblib.dump(self, file)
   1204                 self.embedding_model = embedding_model
   1205             else:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in dump(value, filename, compress, protocol, cache_size)
    480             NumpyPickler(f, protocol=protocol).dump(value)
    481     else:
--> 482         NumpyPickler(filename, protocol=protocol).dump(value)
    483 
    484     # If the target container is a file object, nothing is returned.
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in dump(self, obj)
    435         if self.proto >= 4:
    436             self.framer.start_framing()
--> 437         self.save(obj)
    438         self.write(STOP)
    439         self.framer.end_framing()
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    547 
    548         # Save the reduce() output and finally memoize the object
--> 549         self.save_reduce(obj=obj, *rv)
    550 
    551     def persistent_id(self, obj):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    660 
    661         if state is not None:
--> 662             save(state)
    663             write(BUILD)
    664 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_dict(self, obj)
    857 
    858         self.memoize(obj)
--> 859         self._batch_setitems(obj.items())
    860 
    861     dispatch[dict] = save_dict
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in _batch_setitems(self, items)
    883                 for k, v in tmp:
    884                     save(k)
--> 885                     save(v)
    886                 write(SETITEMS)
    887             elif n:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    547 
    548         # Save the reduce() output and finally memoize the object
--> 549         self.save_reduce(obj=obj, *rv)
    550 
    551     def persistent_id(self, obj):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
    660 
    661         if state is not None:
--> 662             save(state)
    663             write(BUILD)
    664 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    502         f = self.dispatch.get(t)
    503         if f is not None:
--> 504             f(self, obj) # Call unbound method with explicit self
    505             return
    506 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_dict(self, obj)
    857 
    858         self.memoize(obj)
--> 859         self._batch_setitems(obj.items())
    860 
    861     dispatch[dict] = save_dict
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in _batch_setitems(self, items)
    883                 for k, v in tmp:
    884                     save(k)
--> 885                     save(v)
    886                 write(SETITEMS)
    887             elif n:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
    280             return
    281 
--> 282         return Pickler.save(self, obj)
    283 
    284 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
    522             reduce = getattr(obj, "__reduce_ex__", None)
    523             if reduce is not None:
--> 524                 rv = reduce(self.proto)
    525             else:
    526                 reduce = getattr(obj, "__reduce__", None)
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/pynndescent/pynndescent_.py in __getstate__(self)
    900     def __getstate__(self):
    901         if not hasattr(self, "_search_graph"):
--> 902             self._init_search_graph()
    903         if not hasattr(self, "_search_function"):
    904             if self._is_sparse:
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/pynndescent/pynndescent_.py in _init_search_graph(self)
   1061                 self._distance_func,
   1062                 self.rng_state,
-> 1063                 self.diversify_prob,
   1064             )
   1065         reverse_graph.eliminate_zeros()
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    431                     e.patch_message('\n'.join((str(e).rstrip(), help_msg)))
    432             # ignore the FULL_TRACEBACKS config, this needs reporting!
--> 433             raise e
    434 
    435     def inspect_llvm(self, signature=None):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    364                 argtypes.append(self.typeof_pyval(a))
    365         try:
--> 366             return self.compile(tuple(argtypes))
    367         except errors.ForceLiteralArg as e:
    368             # Received request for compiler re-entry with the list of arguments
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/compiler_lock.py in _acquire_compile_lock(*args, **kwargs)
     30         def _acquire_compile_lock(*args, **kwargs):
     31             with self:
---> 32                 return func(*args, **kwargs)
     33         return _acquire_compile_lock
     34 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in compile(self, sig)
    861                 raise e.bind_fold_arguments(folded)
    862             self.add_overload(cres)
--> 863             self._cache.save_overload(sig, cres)
    864             return cres.entry_point
    865 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save_overload(self, sig, data)
    665         """
    666         with self._guard_against_spurious_io_errors():
--> 667             self._save_overload(sig, data)
    668 
    669     def _save_overload(self, sig, data):
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _save_overload(self, sig, data)
    675         key = self._index_key(sig, _get_codegen(data))
    676         data = self._impl.reduce(data)
--> 677         self._cache_file.save(key, data)
    678 
    679     @contextlib.contextmanager
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save(self, key, data)
    490                     break
    491             overloads[key] = data_name
--> 492             self._save_index(overloads)
    493         self._save_data(data_name, data)
    494 
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _save_index(self, overloads)
    536     def _save_index(self, overloads):
    537         data = self._source_stamp, overloads
--> 538         data = self._dump(data)
    539         with self._open_for_write(self._index_path) as f:
    540             pickle.dump(self._version, f, protocol=-1)
~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _dump(self, obj)
    564 
    565     def _dump(self, obj):
--> 566         return pickle.dumps(obj, protocol=-1)
    567 
    568     @contextlib.contextmanager
TypeError: can't pickle weakref objects

opened by makkarss929 24

No loop matching the specified signature and casting was found for ufunc add

Hi @MaartenGr, Thanks for releasing the new version of BERTopic with Guided Topic Modeling. However, I got an error message for my code

seed_topic_list = [["flight", "air", "norwegian", "aircanada", "air canada", "sas", "stopover", "air france", "airline", "airport"],
                   ["car rental", "car", "rental center", "drover", "ecars", "cars", "car hire", "rent a car", "taxi", "cab", "ground", "chauffeur", "uber"],
                   ["room", "hotel night", "reception", "hotels", "hotel", "rooms","property", "properties", "accommodation"],
                   ["sncf", "sj", "railcard", "railway", "rail", "train", "trains"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list, calculate_probabilities=False)
topics, probs= topic_model.fit_transform(data_de)

The error is

if self.seed_topic_list is not None and self.embedding_model is not None:
--> 287             y, embeddings = self._guided_topic_modeling(embeddings)
TypeError: No loop matching the specified signature and casting was found for ufunc add

I don't think the error is caused by my "data_de", since it works well if I don't specify seed_topic_list. Any suggestions on fixing this error?

opened by YuanyuanLi96 22

reduce_topics assigns many documents to -1

From what I can see from both experience and in the code reduce_topics() reassigns to -1 frequently. Is this the expected behavior? If I'm understanding the overall picture, topic clusters are selected based on the HDBSCAN results and documents are assigned to -1 based on a low likelihood of belonging to an identified cluster. Then these clusters are aggregated and a c_tf_idfscore is calculated for the entire topic. When doing the reduction, the cosine similarity of the topic being reduced is compared with all of the other topics and then assigned to the most similar topic. It seems counter-intuitive that if a particular document was sorted as part of a valid cluster by HDBSCAN, but then discounted per the similarity score during the reduction. It feels like there is a mismatch between doing the initial cluster assignment in a way that captures non-symmetric groupings but then using a Euclidean calculation to determine similarity and therefore topic assignment. While not perfect, wouldn't it be reasonable to omit -1 as a potential assignment?

opened by drob-xx 21
topic extraction from 'Quick Start' taking forever

Hi Maarten, I've been following your Github. I installed Bertopic using conda. Then I tried to replicate your Quick Start to see if it's working as expected:

from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs)

Then, at first I was getting the following (which goes on forever):

runfile('/Users/nayza/Desktop/YTproject/AAHSA/addictionStudy_2.py', wdir='/Users/nayza/Desktop/YTproject/AAHSA') Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Traceback (most recent call last):

But then I suspected that I needed to update my tokenizer so I updated it from version 0.10.3 to 0.11.0. Then, I see that it doesn't show the 'Ignored unknown...' output anymore but it's taking forever to run. Plus, my Mac started to get really loud as well. Do you an idea what might be an issue here?

opened by nzaw96 21
Support of clustering plot (2D UMAP)

Hi there, Just wandering, if the current version of BERTopic supports 2D UMAP plot with clustering, like first plot in original post https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6

Didn't find such plot in documentation, but it could be rather useful in analysis of document collection.

opened by karelin 19
GPU error

TypeError Traceback (most recent call last) in ----> 1 from bertopic import BERTopic 2 from cuml.cluster import HDBSCAN 3 from cuml.manifold import UMAP 4 # Create instances of GPU-accelerated UMAP and HDBSCAN 5 umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)

3 frames /usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py in 507 leaf_size=40, 508 algorithm="best", --> 509 memory=Memory(cachedir=None, verbose=0), 510 approx_min_span_tree=True, 511 gen_min_span_tree=False,

TypeError: init() got an unexpected keyword argument 'cachedir'

opened by research2023 17
topic -1

i use with berttopic to analyze text in social media, dividing the docs to topics and looking for some common and important words in each division. in some case, my model input is 150,000 docs and after the transform, the frequency of topic -1 is very high( 35%-40%).

so I want to know what exactly is topic -1 and what the problem cause is...

Thank you all

opened by roikremer 17
don't save the fitted vectorizer model in the model!

A fix to https://github.com/MaartenGr/BERTopic/issues/383

Storing a fitted vectorizer_model makes the topic model extremely memory hungry. This way, self._c_tf_idf is slower, but it's anyway only ever sped up in the second run in the case of doing topics_over_time or topics_per_class after already fitting the model once.

CountVectorizer is pretty fast already (takes about 2 minutes on my 48.000 document dataset), so I don't think it's worth storing the entire fitted model just to .get_feature_names() a little faster in special use cases compared to the cost in memory. It's in most use cases only fitted once, and the fitting stage isn't time-sensitive either. Plus, the .transform(documents) that had to be done anyway still fits first anyway, so the speedup isn't so big in the first place.

Also updated .get_feature_names() to .get_feature_names_out(), as the former is deprecated and will stop working with sklearn 1.2, which will be out soon.

opened by simonfelding 17

Issue installing BERTTopic

I am trying to install BERTTopic in a Colab Pro+ setting (High-Ram machine)

!pip install bertopic -q
from bertopic import BERTopic

I get the following error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.4 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _dep_map(self)
   3015         try:
-> 3016             return self.__dep_map
   3017         except AttributeError:

18 frames
/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in __getattr__(self, attr)
   2812         if attr.startswith('_'):
-> 2813             raise AttributeError(attr)
   2814         return getattr(self._provider, attr)

AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _parsed_pkg_info(self)
   3006         try:
-> 3007             return self._pkg_info
   3008         except AttributeError:

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in __getattr__(self, attr)
   2812         if attr.startswith('_'):
-> 2813             raise AttributeError(attr)
   2814         return getattr(self._provider, attr)

AttributeError: _pkg_info

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
<ipython-input-14-0ca5028c9978> in <module>()
      1 get_ipython().system('pip install bertopic -q')
      2 
----> 3 from bertopic import BERTopic
      4 force_training = True

/usr/local/lib/python3.7/dist-packages/bertopic/__init__.py in <module>()
----> 1 from bertopic._bertopic import BERTopic
      2 
      3 __version__ = "0.9.3"
      4 
      5 __all__ = [

/usr/local/lib/python3.7/dist-packages/bertopic/_bertopic.py in <module>()
     20 # Models
     21 import hdbscan
---> 22 from umap import UMAP
     23 from sklearn.feature_extraction.text import CountVectorizer
     24 from sklearn.metrics.pairwise import cosine_similarity

/usr/local/lib/python3.7/dist-packages/umap/__init__.py in <module>()
      1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
      3 
      4 try:
      5     with catch_warnings():

/usr/local/lib/python3.7/dist-packages/umap/umap_.py in <module>()
     45 )
     46 
---> 47 from pynndescent import NNDescent
     48 from pynndescent.distances import named_distances as pynn_named_distances
     49 from pynndescent.sparse import sparse_named_distances as pynn_sparse_named_distances

/usr/local/lib/python3.7/dist-packages/pynndescent/__init__.py in <module>()
     13         numba.config.THREADING_LAYER = "workqueue"
     14 
---> 15 __version__ = pkg_resources.get_distribution("pynndescent").version

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in get_distribution(dist)
    464         dist = Requirement.parse(dist)
    465     if isinstance(dist, Requirement):
--> 466         dist = get_provider(dist)
    467     if not isinstance(dist, Distribution):
    468         raise TypeError("Expected string, Requirement, or Distribution", dist)

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in get_provider(moduleOrReq)
    340     """Return an IResourceProvider for the named module or requirement"""
    341     if isinstance(moduleOrReq, Requirement):
--> 342         return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
    343     try:
    344         module = sys.modules[moduleOrReq]

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in require(self, *requirements)
    884         included, even if they were already activated in this working set.
    885         """
--> 886         needed = self.resolve(parse_requirements(requirements))
    887 
    888         for dist in needed:

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in resolve(self, requirements, env, installer, replace_conflicting, extras)
    778 
    779             # push the new requirements onto the stack
--> 780             new_requirements = dist.requires(req.extras)[::-1]
    781             requirements.extend(new_requirements)
    782 

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in requires(self, extras)
   2732     def requires(self, extras=()):
   2733         """List of Requirements needed for this distro if `extras` are used"""
-> 2734         dm = self._dep_map
   2735         deps = []
   2736         deps.extend(dm.get(None, ()))

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _dep_map(self)
   3016             return self.__dep_map
   3017         except AttributeError:
-> 3018             self.__dep_map = self._compute_dependencies()
   3019             return self.__dep_map
   3020 

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _compute_dependencies(self)
   3025         reqs = []
   3026         # Including any condition expressions
-> 3027         for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:
   3028             reqs.extend(parse_requirements(req))
   3029 

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _parsed_pkg_info(self)
   3007             return self._pkg_info
   3008         except AttributeError:
-> 3009             metadata = self.get_metadata(self.PKG_INFO)
   3010             self._pkg_info = email.parser.Parser().parsestr(metadata)
   3011             return self._pkg_info

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in get_metadata(self, name)
   1405             return ""
   1406         path = self._get_metadata_path(name)
-> 1407         value = self._get(path)
   1408         try:
   1409             return value.decode('utf-8')

/usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _get(self, path)
   1609 
   1610     def _get(self, path):
-> 1611         with open(path, 'rb') as stream:
   1612             return stream.read()
   1613 

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/numpy-1.19.5.dist-info/METADATA'

The issue is usually resolved after re-running the command

opened by asafcloud-romarketinggroup 17

topics_over_time gets stuck
Hello!

I am trying to run the topics_over_time() function, in order to later run visualize_topics_over_time(). However when I run topics_over_time(), it runs for 1 it, and then it's stuck.

I am using GPU

The size of my text corpus is 80,000 documents

Please help!

Kind regards
opened by oscarm3l1n 0
ValueError: empty vocabulary; perhaps the documents only contain stop word
Hi, I tried to reduce the outliers in my BERTopic model, I used the first and third ways here but I got this error in both ways: ValueError: empty vocabulary; perhaps the documents only contain stop words.

function to delete stopword:

# Remove Stopwords def remove_stopwords(txt): txt_clean = [word for word in txt.split(' ') if word not in stop_words] #txt.split(' ') new_txt_clean = ' '.join(txt_clean) return new_txt_clean

training the model:

cluster_model = KMeans(n_clusters=50) topic_model = BERTopic(hdbscan_model=cluster_model) topics, probs = topic_model.fit_transform(documents)

what is the reason for this error and how can I solve it
opened by As2066 1
Ways to increase representative documents for a topic?

Hello and apologies if this is not the right method to ask for guidance with BERTopic. I am performing topic modeling and want to get the representative docs for a topic but most seem to only return the default 3 documents when I would like to return say the top n most relevant documents. My intuition says to grab the topic_embeddings_ list and compare the embedding for the doc with that and rank based on cosine similarity, but I saw in another thread from a few months ago that the topic_embeddings_ list is only the avg and not recommended but the docs seem to indicate its the weighted avg now? For my document embeddings, from sentence bert for example, would I need to recalculate them to take into consideration the c_tfidf weights similar to how you generate them in the code to get better similarity ranking?

Thanks and love this framework. Will def be contributing back to it :)

opened by GeorgeDittmar 2
Update _bertopic.py

Adding functionality in topics_over_time to allow users to specify how many terms under each topic they want to see at each timestep t. Right now, the default value is 5, but there are use cases where the user may want to see more than 5

opened by nbalepur 0
topic_model.get_topic()

Hi.. I used this line to display 20 words for topic number 0, topic_model.get_topic(0)[:20] but only 10 words appeared for me. Is there a way to display more words?

opened by As2066 1
Flexibility of Cluster (-1) - Outliers Cluster
Hello everybody!.

I've been experimenting with BERTopic recently and the thing is that, once the model is trained and I visualize the number of docs that each cluster contains... the group with more docs by far is indeed the -1 (under the outlier umbrella)... therefore, if this model goes into production, many of the docs will be considered as outliers.

Is there any way I can remove the clustering inside the outlier (-1) category?. Maybe going after the most similar cluster eventhough there is not enough confidence.

If not, how can I reduce the -1 cluster as much as possible? Maybe with parameters such as min_cluster_size (HDBSCAN) or n_neighbors (UMAP).

In the following repo, is it counting the cluster -1 for the evaluation with OCTIS?.

Many thanks in advance!! :+1:

Here is my model architecture:

# Embedding model: See [1] for more details embedding_model = SentenceTransformer("distiluse-base-multilingual-cased-v1") # Clustering model: See [2] for more details cluster_model = HDBSCAN(min_cluster_size = 15, metric = 'euclidean', cluster_selection_method = 'eom', prediction_data = True) # BERTopic model topic_model = BERTopic(embedding_model = embedding_model, hdbscan_model = cluster_model, language = "multilingual") # Fit the model on a corpus topics, probs = topic_model.fit_transform(text) # topic reduction topic_model.reduce_topics(text, nr_topics=30)
opened by miguelfrutos 2

Releases(v0.12.0)

v0.12.0(Sep 11, 2022)
Highlights

Perform online/incremental topic modeling with .partial_fit

Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer

The parameters bm25_weighting and reduce_frequent_words were added to potentially improve representations:

Expose attributes for easier access to internal data

Added many tests with the intention of making development a bit more stable

Documentation

Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm:

Visualize Overview

Code Overview

Detailed Overview

Added an example of combining BERTopic with KeyBERT

Fixes

Fixed iteratively merging topics (#632 and (#648)

Fixed 0th topic not showing up in visualizations (#667)

Fixed lowercasing not being optional (#682)

Fixed spelling (#664 and (#673)

Fixed 0th topic not shown in .get_topic_info by @oxymor0n in #660

Fixed spelling by @domenicrosati in #674

Add custom labels and title options to barchart @leloykun in #694

Online/incremental topic modeling

Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a .partial_fit function, which is also used in BERTopic.

At a minimum, the cluster model needs to support a .partial_fit function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating.

from sklearn.datasets import fetch_20newsgroups from sklearn.cluster import MiniBatchKMeans from sklearn.decomposition import IncrementalPCA from bertopic.vectorizers import OnlineCountVectorizer from bertopic import BERTopic # Prepare documents all_docs = fetch_20newsgroups(subset="all", remove=('headers', 'footers', 'quotes'))["data"] doc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)] # Prepare sub-models that support online learning umap_model = IncrementalPCA(n_components=5) cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0) vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01) topic_model = BERTopic(umap_model=umap_model, hdbscan_model=cluster_model, vectorizer_model=vectorizer_model) # Incrementally fit the topic model by training on 1000 documents at a time for docs in doc_chunks: topic_model.partial_fit(docs)

Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the .topics_ attribute as variations such as hierarchical topic modeling will not work afterward:

# Incrementally fit the topic model by training on 1000 documents at a time and tracking the topics in each iteration topics = [] for docs in doc_chunks: topic_model.partial_fit(docs) topics.extend(topic_model.topics_) topic_model.topics_ = topics

c-TF-IDF

Explicitly define, use, and adjust the ClassTfidfTransformer with new parameters, bm25_weighting and reduce_frequent_words, to potentially improve the topic representation:

from bertopic import BERTopic from bertopic.vectorizers import ClassTfidfTransformer ctfidf_model = ClassTfidfTransformer(bm25_weighting=True) topic_model = BERTopic(ctfidf_model=ctfidf_model)

Attributes

After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in topic_model.topics_.

| Attribute | Type | Description | |--------------------|----|---------------------------------------------------------------------------------------------| | topics_ | List[int] | The topics that are generated for each document after training or updating the topic model. The most recent topics are tracked. | | probabilities_ | List[float] | The probability of the assigned topic per document. These are only calculated if an HDBSCAN model is used for the clustering step. When calculate_probabilities=True, then it is the probabilities of all topics per document. | | topic_sizes_ | Mapping[int, int] | The size of each topic. | | topic_mapper_ | TopicMapper | A class for tracking topics and their mappings anytime they are merged, reduced, added, or removed. | | topic_representations_ | Mapping[int, Tuple[int, float]] | The top n terms per topic and their respective c-TF-IDF values. | | c_tf_idf_ | csr_matrix | The topic-term matrix as calculated through c-TF-IDF. To access its respective words, run .vectorizer_model.get_feature_names() or .vectorizer_model.get_feature_names_out() | | topic_labels_ | Mapping[int, str] | The default labels for each topic. | | custom_labels_ | List[str] | Custom labels for each topic as generated through .set_topic_labels. | | topic_embeddings_ | np.ndarray | The embeddings for each topic. It is calculated by taking the weighted average of word embeddings in a topic based on their c-TF-IDF values. | | representative_docs_ | Mapping[int, str] | The representative documents for each topic if HDBSCAN is used.
Source code(tar.gz)
Source code(zip)
v0.11.0(Jul 11, 2022)
Highlights

Perform hierarchical topic modeling with .hierarchical_topics

Visualize hierarchical topic representations with .visualize_hierarchy

Extract a text-based hierarchical topic representation with .get_topic_tree

Visualize 2D documents with .visualize_documents()

Visualize 2D hierarchical documents with .visualize_hierarchical_documents()

Create custom labels to the topics throughout most visualizations

Manually merge topics with .merge_topics()

Added native Hugging Face transformers support

Documentation

Added example for finding similar topics between two models in the tips & tricks page

Add multi-modal example in the tips & tricks page

Fixes

Fix support for k-Means in .visualize_heatmap (#532)

Fix missing topic 0 in .visualize_topics (#533)

Fix inconsistencies in .get_topic_info (#572) and (#581)

Add optimal_ordering parameter to .visualize_hierarchy by @rafaelvalero in #390

Fix RuntimeError when used as sklearn estimator by @simonfelding in #448

Fix typo in visualization documentation by @dwhdai in #475

Fix typo in docstrings by @xwwwwww in #549

Support higher Flair versions

Visualization examples

Visualize hierarchical topic representations with .visualize_hierarchy:

Extract a text-based hierarchical topic representation with .get_topic_tree:

. └─atheists_atheism_god_moral_atheist ├─atheists_atheism_god_atheist_argument │ ├─■──atheists_atheism_god_atheist_argument ── Topic: 21 │ └─■──br_god_exist_genetic_existence ── Topic: 124 └─■──moral_morality_objective_immoral_morals ── Topic: 29

Visualize 2D documents with .visualize_documents():

Visualize 2D hierarchical documents with .visualize_hierarchical_documents():

Source code(tar.gz)
Source code(zip)
v0.10.0(Apr 30, 2022)
Highlights

Use any dimensionality reduction technique instead of UMAP:

from bertopic import BERTopic from sklearn.decomposition import PCA dim_model = PCA(n_components=5) topic_model = BERTopic(umap_model=dim_model)

Use any clustering technique instead of HDBSCAN:

from bertopic import BERTopic from sklearn.cluster import KMeans cluster_model = KMeans(n_clusters=50) topic_model = BERTopic(hdbscan_model=cluster_model)

Documentation

Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case

Added pages on how to use other dimensionality reduction and clustering algorithms

Additional instructions on how to reduce outliers in the FAQ:

import numpy as np probability_threshold = 0.01 new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs]

Fixes

Fixed None being returned for probabilities when transforming unseen documents

Replaced all instances of arg: with Arguments: for consistency

Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if min_df is set to a value larger than 1

Set "hdbscan>=0.8.28" to prevent numpy issues

Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues

Update gensim dependency to >=4.0.0 (#371)

Fix topic 0 not appearing in visualizations (#472)

Fix #506

Fix #429

Source code(tar.gz)
Source code(zip)
v0.9.4(Dec 14, 2021)
A number of fixes, documentation updates, and small features:

Highlights:

Expose diversity parameter

Use BERTopic(diversity=0.1) to change how diverse the words in a topic representation are (ranges from 0 to 1)

Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings

Added property to c-TF-IDF that all IDF values should be positive (#351)

Major documentation overhaul (mkdocs, tutorials, FAQ, images, etc. ) (#330)

Additional logging for .transform (#356)

Fixes:

Drop python 3.6 (#333)

Relax plotly dependency (#88)

Improve stability of .visualize_barchart() and .visualize_hierarchy()

Source code(tar.gz)
Source code(zip)
v0.9.3(Oct 17, 2021)
Fix #282, #285, and #288.

Fixes

#282

As it turns out the old implementation of topic mapping was still found in the transform function

#285

Fix getting all representative docs

Fix #288

A recent issue with the package pyyaml that can be found in Google Colab

Remove the YAMLLoadWarning each time BERTopic is imported

import yaml yaml._warnings_enabled["YAMLLoadWarning"] = False
Source code(tar.gz)
Source code(zip)
v0.9.2(Oct 12, 2021)
A release focused on algorithmic optimization and fixing several issues:

Highlights:

Update the non-multilingual paraphrase-* models to the all-* models due to improved performance

Reduce necessary RAM in c-TF-IDF top 30 word extraction

Fixes:

Fix topic mapping

When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version

A new class was created as a way to track these mappings regardless of how many times they were executed

In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model

Fix typo in embeddings page (#200)

Fix link in README (#233)

Fix documentation .visualize_term_rank() (#253)

Fix getting correct representative docs (#258)

Update memory FAQ with HDBSCAN pr

Source code(tar.gz)
Source code(zip)
v0.9.1(Sep 1, 2021)
Fixes:

Fix TypeError when auto-reducing topics (#210)

Fix mapping representative docs when reducing topics (#208)

Fix visualization issues with probabilities (#205)

Fix missing normalize_frequency param in plots (#213)

Source code(tar.gz)
Source code(zip)
v0.9.0(Aug 7, 2021)
Highlights

Implemented a Guided BERTopic -> Use seeds to steer the Topic Modeling

Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)

This allows users to see which documents are good representations of a topic and better understand the topics that were created

Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics

Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True

Added several FAQs

Fixes

Fix loading pre-trained BERTopic model

Fix mapping of probabilities

Fix #190

Guided BERTopic

Guided BERTopic works in two ways:

First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"], ["acquisition", "procurement", "merge"], ["exchange", "currency", "trading", "rate", "euro"], ["grain", "wheat", "corn"], ["coffee", "cocoa"], ["natural", "gas", "oil", "fuel", "products", "petrol"]] topic_model = BERTopic(seed_topic_list=seed_topic_list) topics, probs = topic_model.fit_transform(docs)
Source code(tar.gz)
Source code(zip)
v0.8.1(Jun 8, 2021)
Highlights:

Improved models:

For English documents the default is now: "paraphrase-MiniLM-L6-v2"

For Non-English or multi-lingual documents the default is now: "paraphrase-multilingual-MiniLM-L12-v2"

Both models show not only great performance but are much faster!

Add interactive visualizations to the plotting API documentation

For even better performance, please use the following models:

English: "paraphrase-mpnet-base-v2"

Non-English or multi-lingual: "paraphrase-multilingual-mpnet-base-v2"

Fixes:

Improved unit testing for more stability

Set transformers version for Flair

Source code(tar.gz)
Source code(zip)
v0.8.0(May 31, 2021)
Mainly a visualization update to improve understanding of the topic model.

Features

Additional visualizations:

Topic Hierarchy: topic_model.visualize_hierarchy()

Topic Similarity Heatmap: topic_model.visualize_heatmap()

Topic Representation Barchart: topic_model.visualize_barchart()

Term Score Decline: topic_model.visualize_term_rank()

Improvements

Created bertopic.plotting library to easily extend visualizations

Improved automatic topic reduction by using HDBSCAN to detect similar topics

Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.

Update MKDOCS with new visualizations

Fixes

Fix typo #113, #117

Fix #121 by removing the following two lines:

https://github.com/MaartenGr/BERTopic/blob/5c6cf22776fafaaff728370781a5d33727d3dc8f/bertopic/_bertopic.py#L359-L360

Fix mapping of topics after reduction (it now excludes 0) (#103)

Source code(tar.gz)
Source code(zip)
v0.7.0(Apr 26, 2021)
The two main features are (semi-)supervised topic modeling and several backends to use instead of Flair and SentenceTransformers!

Highlights:

(semi-)supervised topic modeling by leveraging supervised options in UMAP

model.fit(docs, y=target_classes)

Backends:

Added Spacy, Gensim, USE (TFHub)

Use a different backend for document embeddings and word embeddings

Create your own backends with bertopic.backend.BaseEmbedder

Click here for an overview of all new backends

Calculate and visualize topics per class

Calculate: topics_per_class = topic_model.topics_per_class(docs, topics, classes)

Visualize: topic_model.visualize_topics_per_class(topics_per_class)

Several tutorials were updated and added:

| Name | Link | |---|---| | Topic Modeling with BERTopic | | | (Custom) Embedding Models in BERTopic | | | Advanced Customization in BERTopic | | | (semi-)Supervised Topic Modeling with BERTopic | | | Dynamic Topic Modeling with Trump's Tweets | |

Fixes:

Fixed issues with Torch req

Prevent saving term frequency matrix in CTFIDF class

Fixed DTM not working when reducing topics (#96)

Moved visualization dependencies to base BERTopic

pip install bertopic[visualization] becomes pip install bertopic

Allow precomputed embeddings in bertopic.find_topics() (#79):

model = BERTopic(embedding_model=my_embedding_model) model.fit(docs, my_precomputed_embeddings) model.find_topics(search_term)
Source code(tar.gz)
Source code(zip)
v0.6.0(Mar 9, 2021)
Highlights:

DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation

model.topics_over_time(docs, timestamps, global_tuning=True)

DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time

Only uses topics at t-1 and skips evolution if there is a gap

model.topics_over_time(docs, timestamps, evolution_tuning=True)

DTM: Function to visualize topics over time

model.visualize_topics_over_time(topics_over_time)

DTM: Add binning of timestamps

model.topics_over_time(docs, timestamps, nr_bins=10)

Add function get general information about topics (id, frequency, name, etc.)

get_topic_info()

Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents

Fixes:

_map_probabilities() does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (#63, #64)

Source code(tar.gz)
Source code(zip)
v0.5.0(Feb 8, 2021)
Features

Add Flair to allow for more (custom) token/document embeddings

Option to use custom UMAP, HDBSCAN, and CountVectorizer

Added low_memory parameter to reduce memory during computation

Improved verbosity (shows progress bar)

Improved testing

Use the newest version of sentence-transformers as it speeds ups encoding significantly

Return the figure of visualize_topics()

Expose all parameters with a single function: get_params()

Option to disable the saving of embedding_model, should reduce BERTopic size significantly

Add FAQ page

Fixes

To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.

Set calculate_probabilities to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.

Source code(tar.gz)
Source code(zip)
v0.4.2(Jan 10, 2021)

Fixed the parameter embedding_model not working properly when language had been set. If you are using an older version of BERTopic, please set language to False when you want to set embedding_model.
Source code(tar.gz)
Source code(zip)
v0.4.1(Jan 7, 2021)

There was an issue with selecting the correct language model. This is now fixed with this small pypi update.
Source code(tar.gz)
Source code(zip)
v0.4.0(Dec 21, 2020)
Highlights:

Visualize Topics similar to LDAvis

Added option to reduce topics after training

Added option to update topic representation after training

Added option to search topics using a search term

Significantly improved the stability of generating clusters

Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values

More extensive tutorials in the documentation

Notable Changes:

Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic

Improved logging (remove duplicates)

Check if BERTopic is fitted

Added TF-IDF as an embedder instead of transformer models (see tutorial)

Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.

Preprocess text before passing it through c-TF-IDF

Merged get_topics_freq() with get_topic_freq()

Fixes:

Fix error handling topic probabilities

Source code(tar.gz)
Source code(zip)
v0.3.2(Nov 16, 2020)

Fixed a bug with the topic reduction method that seems to reduce the number of topics but not to the nr_topics as defined in the class. Since this was, to a certain extend, breaking the topic reduction method a new release was necessary.
Source code(tar.gz)
Source code(zip)
v0.3.1(Nov 4, 2020)

Adding the option to use custom embeddings or embeddings that you generated beforehand with whatever package you'd like to use. This allows users to further customize BERTopic to their liking.

NOTE: I cannot guarantee that using your own embeddings would result in better performance. It is likely to swing both ways depending on the embeddings you are using. For example, if you use poorly-trained W2V embeddings then it is likely to result in a poor topic generation. Thus, it is up to the user to experiment with the embeddings that best serve their purposes.
Source code(tar.gz)
Source code(zip)
v0.3.0(Oct 29, 2020)
transform() and fit_transform() now also return the topic probability distributions

Added visualize_distribution() which visualizes the topic probability distribution for a single document

Source code(tar.gz)
Source code(zip)
v0.2.3(Oct 17, 2020)
Fixed n_gram_range not being used

Added option for using stopwords

Source code(tar.gz)
Source code(zip)
v0.2.1(Oct 11, 2020)

Improved the calculation of the class-based TF-IDF procedure by limiting the calculation to sparse matrices. This prevents out-of-memory problems when faced with large datasets.
Source code(tar.gz)
Source code(zip)
v0.1.2(Oct 1, 2020)

When transforming new documents, self.mapped_topics seemed to be missing. Added to the init.
Source code(tar.gz)
Source code(zip)
v0.1.1(Sep 24, 2020)
Fixed requirements --> Issue with pytorch

Update docs

Update readme

Source code(tar.gz)
Source code(zip)
v0.1.0(Sep 24, 2020)
Added parameters for UMAP and HDBSCAN

Option to choose sentence-transformer model

Method for transforming unseen documents

Save and load trained models (UMAP and HDBSCAN)

Extract topics and their sizes

Optimized c-TF-IDF

Improved documentation

Improved topic reduction

Source code(tar.gz)
Source code(zip)

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

Related tags

Overview

BERTopic

Installation

Getting Started

Quick Start

Visualize Topics

Embedding Models

Dynamic Topic Modeling

Overview

Citation

Comments

Releases(v0.12.0)

v0.12.0(Sep 11, 2022)

Highlights

Documentation

Fixes

Online/incremental topic modeling

c-TF-IDF

Attributes

v0.11.0(Jul 11, 2022)

Highlights

Documentation

Fixes

Visualization examples

Visualize hierarchical topic representations with .visualize_hierarchy:

Extract a text-based hierarchical topic representation with .get_topic_tree:

Visualize 2D documents with .visualize_documents():

Visualize 2D hierarchical documents with .visualize_hierarchical_documents():

v0.10.0(Apr 30, 2022)

Highlights

Documentation

Fixes

v0.9.4(Dec 14, 2021)

Highlights:

Fixes:

v0.9.3(Oct 17, 2021)

Fixes

v0.9.2(Oct 12, 2021)

v0.9.1(Sep 1, 2021)

v0.9.0(Aug 7, 2021)

Highlights

Fixes

Guided BERTopic

v0.8.1(Jun 8, 2021)

v0.8.0(May 31, 2021)

Features

Improvements

Fixes

v0.7.0(Apr 26, 2021)

v0.6.0(Mar 9, 2021)

v0.5.0(Feb 8, 2021)

Features

Fixes

v0.4.2(Jan 10, 2021)

v0.4.1(Jan 7, 2021)

v0.4.0(Dec 21, 2020)

v0.3.2(Nov 16, 2020)

v0.3.1(Nov 4, 2020)

v0.3.0(Oct 29, 2020)

v0.2.3(Oct 17, 2020)

v0.2.1(Oct 11, 2020)

v0.1.2(Oct 1, 2020)

v0.1.1(Sep 24, 2020)

v0.1.0(Sep 24, 2020)

Owner

Maarten Grootendorst

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Stand-alone language identification system

NLP topic mdel LDA - Gathered from New York Times website

Spooky Skelly For Python

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

Open source code for AlphaFold.

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This is a simple item2vec implementation using gensim for recbole

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Machine learning models from Singapore's NLP research community

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Resources for "Natural Language Processing" Coursera course.

Visualize hierarchical topic representations with `.visualize_hierarchy`:

Extract a text-based hierarchical topic representation with `.get_topic_tree`:

Visualize 2D documents with `.visualize_documents()`:

Visualize 2D hierarchical documents with `.visualize_hierarchical_documents()`: