Fast topic modeling platform

Last update: Dec 21, 2022

Overview

The state-of-the-art platform for topic modeling.

What is BigARTM?

BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

References

Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections // Analysis of Images, Social Networks and Texts. 2015.
Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M., Yanina A. Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, October 19, 2015 - pp. 29-37.
Vorontsov K., Potapenko A., Plavin A. Additive Regularization of Topic Models for Topic Selection and Sparse Factorization. // Statistical Learning and Data Sciences. 2015 — pp. 193-202.
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning Journal, Special Issue “Data Analysis and Intelligent Optimization”, Springer, 2014.
More publications can be found in our wiki page.

Related Software Packages

TopicNet is a high-level interface for BigARTM which is helpful for rapid solution prototyping and for exploring the topics of finished ARTM models.
David Blei's List of Open Source topic modeling software
MALLET: Java-based toolkit for language processing with topic modeling package
Gensim: Python topic modeling library
Vowpal Wabbit has an implementation of Online-LDA algorithm

Installation

Installing with pip (Linux only)

We have a PyPi release for Linux:

$ pip install bigartm

$ pip install bigartm10

Installing on Windows

We suggest using pre-build binaries.

It is also possible to compile C++ code on Windows you want the latest development version.

Installing on Linux / MacOS

Download binary release or build from source using cmake:

$ mkdir build && cd build
$ cmake ..
$ make install

See here for detailed instructions.

How to Use

Command-line interface

Check out documentation for bigartm.

Examples:

Basic model (20 topics, outputed to CSV-file, inferred in 10 passes)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt
--passes 10 --batch-size 50 --topics 20

Basic model with less tokens (filtered extreme values based on token's frequency)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt

Simple regularized model (increase sparsity up to 60-70%)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20  --write-model-readable model.txt 
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"

More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background"

Interactive Python interface

BigARTM supports full-featured and clear Python API (see Installation to configure Python API for your OS).

Example:

import artm

# Prepare data
# Case 1: data in CountVectorizer format
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from numpy import array

cv = CountVectorizer(max_features=1000, stop_words='english')
n_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T
vocabulary = cv.get_feature_names()

bv = artm.BatchVectorizer(data_format='bow_n_wd',
                          n_wd=n_wd,
                          vocabulary=vocabulary)

# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)
bv = artm.BatchVectorizer(data_format='bow_uci',
                          collection_name='kos',
                          target_folder='kos_batches')

# Learn simple LDA model (or you can use advanced artm.ARTM)
model = artm.LDA(num_topics=15, dictionary=bv.dictionary)
model.fit_offline(bv, num_collection_passes=20)

# Print results
model.get_top_tokens()

Refer to tutorials for details on how to start using BigARTM from Python, user's guide can provide information about more advanced features and cases.

Low-level API

Contributing

Refer to the Developer's Guide and follows Code Style.

To report a bug use issue tracker. To ask a question use our mailing list. Feel free to make pull request.

License

BigARTM is released under New BSD License that allowes unlimited redistribution for any purpose (even for commercial use) as long as its copyright notices and the license’s disclaimers of warranty are maintained.

Comments

Sphinx python docs

This pull request introduces basic functionality of generating Python API documentation automatically, using docstrings. Feel free to comment and suggest new ideas before merge :)

Unfortunately I didn't manage to create pull request against stable branch, because I created my branch from master and couldn't merge commits on master_component.py :-(

opened by JeanPaulShapo 20

Build fails on MacOS 10.12

Jeroens-MacBook-Pro:build jeroen$ brew install boost
Updating Homebrew...
==> Auto-updated Homebrew!
Updated 1 tap (homebrew/core).
==> Updated Formulae
aws-sdk-cpp                  conan                        knot                         mercurial                    servus                       vim
awscli                       docker-machine               knot-resolver                phoronix-test-suite          svgcleaner                   wpcli-completion
bazel                        docker-machine-completion    libgphoto2                   sdl_mixer                    termius                      yara
citus                        gphoto2                      makepkg                      sdl_sound                    vapoursynth

==> Downloading https://homebrew.bintray.com/bottles/boost-1.64.0_1.sierra.bottle.tar.gz
######################################################################## 100.0%
==> Pouring boost-1.64.0_1.sierra.bottle.tar.gz
==> Using the sandbox
🍺  /usr/local/Cellar/boost/1.64.0_1: 12,628 files, 395.7MB
Jeroens-MacBook-Pro:build jeroen$ unset  BOOST_INCLUDEDIR
Jeroens-MacBook-Pro:build jeroen$ cmake ..
-- Build type: Release
-- Boost version: 1.64.0
-- Boost version: 1.64.0
-- Found the following Boost libraries:
--   thread
--   program_options
--   date_time
--   filesystem
--   iostreams
--   system
--   chrono
--   timer
--   atomic
--   regex
-- Looking for C++ include stdint.h
-- Looking for C++ include stdint.h - found
-- Looking for C++ include inttypes.h
-- Looking for C++ include inttypes.h - found
-- Looking for C++ include sys/types.h
-- Looking for C++ include sys/types.h - found
-- Looking for C++ include sys/stat.h
-- Looking for C++ include sys/stat.h - found
-- Looking for C++ include fnmatch.h
-- Looking for C++ include fnmatch.h - found
-- Looking for strtoll
-- Looking for strtoll - found
-- Looking for C++ include stddef.h
-- Looking for C++ include stddef.h - found
-- Check size of pthread_rwlock_t
-- Check size of pthread_rwlock_t - done
-- running mz compiler detection tools
-- compiler is clang
-- GCC compatible compiler found
-- compiler version Apple LLVM version 8.1.0 (clang-802.0.42)
Target: x86_64-apple-darwin16.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
-- C++11 support detected
-- 64bit platform
-- Today is: Tue, 06 Jun 2017 21:41:10 +0200

-- forcing C++11 support on this platform
-- configuring for build type: Release
-- Looking for include file dlfcn.h
-- Looking for include file dlfcn.h - found
-- Looking for include file execinfo.h
-- Looking for include file execinfo.h - found
-- Looking for include file glob.h
-- Looking for include file glob.h - found
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for include file libunwind.h
-- Looking for include file libunwind.h - found
-- Looking for include file libunwind.h
-- Looking for include file libunwind.h - found
-- Looking for include file gflags/gflags.h
-- Looking for include file gflags/gflags.h - not found
-- Looking for include file memory.h
-- Looking for include file memory.h - found
-- Looking for include file pthread.h
-- Looking for include file pthread.h - found
-- Looking for include file pwd.h
-- Looking for include file pwd.h - found
-- Looking for include file stdlib.h
-- Looking for include file stdlib.h - found
-- Looking for include file strings.h
-- Looking for include file strings.h - found
-- Looking for include file syscall.h
-- Looking for include file syscall.h - not found
-- Looking for include file syslog.h
-- Looking for include file syslog.h - found
-- Looking for include file sys/call.h
-- Looking for include file sys/call.h - not found
-- Looking for include file sys/time.h
-- Looking for include file sys/time.h - found
-- Looking for include file sys/syscall.h
-- Looking for include file sys/syscall.h - found
-- Looking for include file sys/ucontext.h
-- Looking for include file sys/ucontext.h - found
-- Looking for include file sys/utsname.h
-- Looking for include file sys/utsname.h - found
-- Looking for include file ucontext.h
-- Looking for include file ucontext.h - not found
-- Looking for include file unwind.h
-- Looking for include file unwind.h - found
-- Looking for fcntl
-- Looking for fcntl - found
-- Looking for sigaltstack
-- Looking for sigaltstack - found
-- Looking for __builtin_expect
-- Looking for __builtin_expect - not found
-- Looking for __sync_val_compare_and_swap
-- Looking for __sync_val_compare_and_swap - not found
CMake Warning (dev) at 3rdparty/protobuf-3.0.0/cmake/install.cmake:41 (message):
  The file
  "/Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/src/google/protobuf/repeated_field_reflection.h"
  is listed in
  "/Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/cmake/cmake/extract_includes.bat.in"
  but there not exists.  The file will not be installed.
Call Stack (most recent call first):
  3rdparty/protobuf-3.0.0/cmake/CMakeLists.txt:159 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Performing Test COMPILER_SUPPORTS_CXX11
-- Performing Test COMPILER_SUPPORTS_CXX11 - Success
-- Performing Test COMPILER_SUPPORTS_CXX0X
-- Performing Test COMPILER_SUPPORTS_CXX0X - Success
-- Found GLOG: /Users/jeroen/Downloads/bigartm/3rdparty/glog/src
-- Boost version: 1.64.0
-- Found the following Boost libraries:
--   thread
--   program_options
--   date_time
--   filesystem
--   iostreams
--   system
--   chrono
--   timer
--   atomic
--   regex
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/jeroen/Downloads/bigartm/build
Jeroens-MacBook-Pro:build jeroen$ make
Scanning dependencies of target gflags-static
[  0%] Building CXX object 3rdparty/gflags/CMakeFiles/gflags-static.dir/src/gflags.cc.o
[  0%] Building CXX object 3rdparty/gflags/CMakeFiles/gflags-static.dir/src/gflags_reporting.cc.o
[  1%] Building CXX object 3rdparty/gflags/CMakeFiles/gflags-static.dir/src/gflags_completions.cc.o
[  1%] Linking CXX static library ../../lib/libgflags.a
[  1%] Built target gflags-static
Scanning dependencies of target google-glog
[  1%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/logging.cc.o
[  2%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/raw_logging.cc.o
[  2%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/vlog_is_on.cc.o
[  2%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/utilities.cc.o
[  3%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/demangle.cc.o
[  3%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/symbolize.cc.o
[  3%] Building CXX object 3rdparty/glog/CMakeFiles/google-glog.dir/src/signalhandler.cc.o
[  4%] Linking CXX static library ../../lib/libgoogle-glog.a
[  4%] Built target google-glog
Scanning dependencies of target libprotobuf
[  4%] Building CXX object 3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/arena.cc.o
[  5%] Building CXX object 3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/arenastring.cc.o
/Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/src/google/protobuf/arenastring.cc:41:22: error: redefinition of 'AssignWithDefault'
void ArenaStringPtr::AssignWithDefault(const ::std::string* default_value,
                     ^
/usr/local/include/google/protobuf/arenastring.h:316:29: note: previous definition is here
inline void ArenaStringPtr::AssignWithDefault(const ::std::string* default_value,
                            ^
/Users/jeroen/Downloads/bigartm/3rdparty/protobuf-3.0.0/src/google/protobuf/arenastring.cc:47:48: error: too many arguments to function call, expected 0, have 1
    SetNoArena(default_value, value.GetNoArena(default_value));
                              ~~~~~~~~~~~~~~~~ ^~~~~~~~~~~~~
/usr/local/include/google/protobuf/arenastring.h:225:3: note: 'GetNoArena' declared here
  inline const ::std::string& GetNoArena() const { return *ptr_; }
  ^
2 errors generated.
make[2]: *** [3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/__/src/google/protobuf/arenastring.cc.o] Error 1
make[1]: *** [3rdparty/protobuf-3.0.0/cmake/CMakeFiles/libprotobuf.dir/all] Error 2
make: *** [all] Error 2

opened by jeroen 19

CentOS linker issue

On CentOS 7 I got following:

[ 99%] Building CXX object src/bigartm/CMakeFiles/bigartm.dir//artm/cpp_interface.cc.o [100%] Building CXX object src/bigartm/CMakeFiles/bigartm.dir//artm/messages.pb.cc.o Linking CXX executable bigartm /usr/bin/ld: cannot find -lboost_thread-mt /usr/bin/ld: cannot find -lboost_program_options-mt /usr/bin/ld: cannot find -lboost_date_time-mt /usr/bin/ld: cannot find -lboost_filesystem-mt /usr/bin/ld: cannot find -lboost_iostreams-mt /usr/bin/ld: cannot find -lboost_system-mt /usr/bin/ld: cannot find -lboost_chrono-mt /usr/bin/ld: cannot find -lboost_timer-mt /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lstdc++ /usr/bin/ld: cannot find -lm /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lc collect2: error: ld returned 1 exit status make[2]: *** [src/bigartm/bigartm] Error 1 make[1]: *** [src/bigartm/CMakeFiles/bigartm.dir/all] Error 2 make: *** [all] Error 2

Actually I have all this files, but they start with lib instead of l. I'm not make/cmake expert, so can't figure out how to fix this.
documentation build

opened by dselivanov 19
Build standalone Python wheels
I call "a standalone wheel" a package which does not require the libartm installation - aka "tensorflow style". Those wheels can be safely pushed on PyPi and will just work.

I had to change:

python/CMakeLists.txt: add another custom target to execute setup.py bdist_wheel

python/setup.py: hack setuptools to include the needed shared library

python/artm/wrapper/api.py: refactor _load_cdll() to load the library from several places - the common practice. The env variable is the most important, then the default relative path and finally the absolute path in the package root.

docs - as far as I could

Besides, there seems to be a bug in TravisCI configuration regarding python command detection. There is no explicit -DPYTHON in the build script, and by default cmake chooses python2 which prevents from testing py3.x:

if (MSVC OR APPLE) set(PYTHON python CACHE INTERNAL "Python command") else (MSVC OR APPLE) set(PYTHON python2 CACHE INTERNAL "Python command") endif (MSVC OR APPLE)

bdist_wheel command helped to find this bug: there is no wheel package installed in py2 env in Travis. I dared to remove python2 branch completely.
opened by vmarkovtsev 18
Auto-generate C++-headers and sources from Proto-files

Overview of changes are contained in commits' names. It includes usage of ${PROTOBUF_PROTOC_EXECUTABLE} in CMakeLists, so I strongly recommend someone with Windows to check these changes (maybe @sashafrey or @MelLain?) On my laptop wth Fedora 22 and no system protobuf distibution it works perfectly.

opened by JeanPaulShapo 14
Get ASCII encoding problem if run .fit_offline
Try to run "Демострация BigARTM (версия 0.8.0).ipynb" from https://www.coursera.org/learn/unsupervised-learning/supplement/suSWG/noutbuk-iz-diemonstratsii-ispol-zovaniia-bigartm

Use artm.version() 0.8.1

Run model_artm.fit_offline(batch_vectorizer=batch_vectorizer, num_collection_passes=40)

Get C:\Coursera\Anaconda2\lib\site-packages\protobuf-2.5.1rc0-py2.7.egg\google\protobuf\internal\type_checkers.pyc in CheckValue(self, proposed_value)

126 'encoding. Non-ASCII strings must be converted to ' 127 'unicode objects before being added.' % --> 128 (proposed_value))

ValueError: 'c:\Coursera\week4\school_batches\aaaaaa.batch' has type str, but isn't in 7-bit ASCII encoding. Non-ASCII strings must be converted to unicode objects before being added.

What could be the problem?
opened by ValeraSarapas 13

master build fails outside of bigartm/build

While trying to build BigARTM outside of bigartm/build (i.e. somewhere like bigartm/../build) I got the following error:

Generating ./artm/wrapper/messages_pb2.py...
Traceback (most recent call last):
  File "/home/omtcyf0/Documents/dev/src/bigartm/python/setup.py", line 83, in <module>
    cmdclass = {'build': build},
  File "/usr/lib/python2.7/distutils/core.py", line 151, in setup
    dist.run_commands()
  File "/usr/lib/python2.7/distutils/dist.py", line 953, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
    cmd_obj.run()
  File "/home/omtcyf0/Documents/dev/src/bigartm/python/setup.py", line 69, in run
    "./artm/wrapper/messages_pb2.py")
  File "/home/omtcyf0/Documents/dev/src/bigartm/python/setup.py", line 51, in generate_proto_files
    if subprocess.call(protoc_command):
  File "/usr/lib/python2.7/subprocess.py", line 522, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
AttributeError: 'NoneType' object has no attribute 'rfind'

Ubuntu 15.10, Python 2.7.10

opened by kirillbobyrev 12

make install fails on ubuntu

error: Setup script exited with error: Command "x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -Inumpy/core/include -Ibuild/src.linux-x86_64-2.7/numpy/core/include/numpy -I/usr/include/python2.7 -Ibuild/src.linux-x86_64-2.7/numpy/core/src/private -Ibuild/src.linux-x86_64-2.7/numpy/core/src/private -Ibuild/src.linux-x86_64-2.7/numpy/core/src/private -c build/src.linux-x86_64-2.7/numpy/core/src/npymath/ieee754.c -o build/temp.linux-x86_64-2.7/build/src.linux-x86_64-2.7/numpy/core/src/npymath/ieee754.o" failed with exit status 1numpy/core/src/npymath/ieee754.c.src:7:29: fatal error: npy_math_common.h: No such file or directory

And I don't even need numpy, I already installed it by pip install AND by anaconda :(
build

opened by vadamoto 11
Enable running bigartm-cli as standalone executable.

It works on Linux (Fedora 21) without Intel MKL. Before merging it's strongly recommended to test on Windows/Linux with/without Intel MKL (i don't have Windows and Intel MKL binaries :( )
[PR: review OK]

opened by JeanPaulShapo 11
Error during MAKE on Ubuntu 16

Try to make, MAKE command and get

[ 93%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/master_model_test.cc.o In file included from /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/master_model_test.cc:7:0: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h: In instantiation of ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int]’: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18898:30: required from ‘static testing::AssertionResult testing::internal::EqHelper<lhs_is_null_literal>::Compare(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int; bool lhs_is_null_literal = false]’ /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/master_model_test.cc:110:5: required from here /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18861:16: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (expected == actual) { ^ [ 94%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/multiple_classes_test.cc.o [ 94%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/regularizers_test.cc.o [ 94%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/scores_test.cc.o /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/scores_test.cc: In member function ‘virtual void Scores_ScoreTrackerExportImport_Test::TestBody()’: /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/scores_test.cc:165:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (size_t i = 0; i < nPasses; ++i) { ^ /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/scores_test.cc:172:24: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (size_t i = 0; i < nPasses; ++i) { ^ [ 95%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/repeatable_result_test.cc.o In file included from /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/repeatable_result_test.cc:4:0: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h: In instantiation of ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int]’: /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18898:30: required from ‘static testing::AssertionResult testing::internal::EqHelper<lhs_is_null_literal>::Compare(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int; T2 = int; bool lhs_is_null_literal = false]’ /home/vshebuniayeu/bigARTM/bigartm/src/artm_tests/repeatable_result_test.cc:72:3: required from here /home/vshebuniayeu/bigARTM/bigartm/3rdparty/gtest/fused-src/gtest/gtest.h:18861:16: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (expected == actual) { ^ [ 95%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/supcry_test.cc.o [ 95%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/template_manager_test.cc.o [ 96%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/test_mother.cc.o [ 96%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/thread_safe_holder_test.cc.o [ 97%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/topic_seg_test.cc.o [ 97%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir///3rdparty/gtest/fused-src/gtest/gtest_main.cc.o [ 97%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir///3rdparty/gtest/fused-src/gtest/gtest-all.cc.o [ 98%] Building CXX object src/artm_tests/CMakeFiles/artm_tests.dir/__/artm/cpp_interface.cc.o [ 98%] Linking CXX executable ../../bin/artm_tests CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_MatrixMarket_Test::TestBody()': collection_parser_test.cc:(.text+0x760): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_VowpalWabbit_Test::TestBody()': collection_parser_test.cc:(.text+0xdea): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_UciBagOfWords_Test::TestBody()': collection_parser_test.cc:(.text+0x29c0): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' CMakeFiles/artm_tests.dir/collection_parser_test.cc.o: In function CollectionParser_Multiclass_Test::TestBody()': collection_parser_test.cc:(.text+0x33d2): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' ../../lib/libartm-static.a(helpers.cc.o): In function artm::core::Helpers::ListAllBatches(boost::filesystem::path const&)': helpers.cc:(.text+0x1124): undefined reference toboost::filesystem::path_traits::dispatch(boost::filesystem::directory_entry const&, std::string&)' collect2: error: ld returned 1 exit status src/artm_tests/CMakeFiles/artm_tests.dir/build.make:606: recipe for target 'bin/artm_tests' failed make[2]: *** [bin/artm_tests] Error 1 CMakeFiles/Makefile2:661: recipe for target 'src/artm_tests/CMakeFiles/artm_tests.dir/all' failed make[1]: *** [src/artm_tests/CMakeFiles/artm_tests.dir/all] Error 2 Makefile:138: recipe for target 'all' failed make: *** [all] Error 2

question

opened by vladimircape 10
Fix problem with newline in vocab files
@ofrei

this PR indents to fix #519;

new vocab file must be without the last newline;

is it good idea to have identical code in src/artm/core/dicitonary.cc and src/artm/core/collection_parser.cc?

[PR: review OK]
opened by JeanPaulShapo 10
Bump protobuf from 3.0.0 to 3.18.3 in /docs
Bumps protobuf from 3.0.0 to 3.18.3.

Release notes

Sourced from protobuf's releases.

Protocol Buffers v3.18.3

C++

Reduce memory consumption of MessageSet parsing

This release addresses a Security Advisory for C++ and Python users

Protocol Buffers v3.16.1

Java

Improve performance characteristics of UnknownFieldSet parsing (#9371)

Protocol Buffers v3.18.2

Java

Improve performance characteristics of UnknownFieldSet parsing (#9371)

Protocol Buffers v3.18.1

Python

Update setup.py to reflect that we now require at least Python 3.5 (#8989)

Performance fix for DynamicMessage: force GetRaw() to be inlined (#9023)

Ruby

Update ruby_generator.cc to allow proto2 imports in proto3 (#9003)

Protocol Buffers v3.18.0

C++

Fix warnings raised by clang 11 (#8664)

Make StringPiece constructible from std::string_view (#8707)

Add missing capability attributes for LLVM 12 (#8714)

Stop using std::iterator (deprecated in C++17). (#8741)

Move field_access_listener from libprotobuf-lite to libprotobuf (#8775)

Fix #7047 Safely handle setlocale (#8735)

Remove deprecated version of SetTotalBytesLimit() (#8794)

Support arena allocation of google::protobuf::AnyMetadata (#8758)

Fix undefined symbol error around SharedCtor() (#8827)

Fix default value of enum(int) in json_util with proto2 (#8835)

Better Smaller ByteSizeLong

Introduce event filters for inject_field_listener_events

Reduce memory usage of DescriptorPool

For lazy fields copy serialized form when allowed.

Re-introduce the InlinedStringField class

v2 access listener

Reduce padding in the proto's ExtensionRegistry map.

GetExtension performance optimizations

Make tracker a static variable rather than call static functions

Support extensions in field access listener

Annotate MergeFrom for field access listener

Fix incomplete types for field access listener

Add map_entry/new_map_entry to SpecificField in MessageDifferencer. They record the map items which are different in MessageDifferencer's reporter.

Reduce binary size due to fieldless proto messages

TextFormat: ParseInfoTree supports getting field end location in addition to start.

... (truncated)

Commits

a902b39 No-op whitespace change

ae62acd Updating version.json and repo version numbers to: 18.3

f43ac49 Merge pull request #10542 from deannagarcia/3.18.x

9efdf55 Add missing includes

d1635e1 Apply patch

5b37c91 Update version.json with "lts": true (#10534)

c39d622 Merge pull request #10529 from protocolbuffers/deannagarcia-patch-5

f77d3b6 Update version.json

8178b06 Merge pull request #10503 from deannagarcia/3.18.x

24ca839 Add version file

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Fix README:

Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

opened by fkrasnov 0
Fix the warning of scipy

/usr/local/lib/python3.8/dist-packages/artm/batches_utils.py:227: DeprecationWarning: Please use spmatrix from the scipy.sparse namespace, the scipy.sparse.base namespace is deprecated. from scipy.sparse.base import spmatrix

opened by fkrasnov 0
nan in Perplexity for big file with train data

Hi!

I use bigartm=0.9.2 to train my topic model. And I run into some unexpectable behaviour of Perplexity score. I have a very big file with train data. And I got nan in perplexity score after train my model. I find out that if train file is bigger than 1000 documents, I have nan, and when it <= I got numerical value of Perplexity score. I tried to use bigartm==0.10.2, it doesn't help.

I also tried to use batch_size=100, it doesnt't help too. But when I set batch_size==100000 or batch_size==10000, I have numerical value of Perplexity score.

I happy that I solved my problem, but this behaviour is unexpectable. And now it turns out that I should use very big batch_size. My train data is getting biggerover time, so I want to know in advance what batch_size I shold use to not have nan in Perplexity score in future. Which batch_size do you recommend? Should I use batch_size that is the same size as number of documents in my train collection?

opened by Guince 1
vocab.kos.txt does not exist ('interactive python interface' Example from ReadMe)

File "git_script.py", line 18, in bv = artm.BatchVectorizer(data_format='bow_uci', File "/home/roman/.local/lib/python3.8/site-packages/artm/batches_utils.py", line 113, in init self._parse_uci_or_vw(data_weight=data_weight, File "/home/roman/.local/lib/python3.8/site-packages/artm/batches_utils.py", line 191, in _parse_uci_or_vw lib.ArtmParseCollection(parser_config) File "/home/roman/.local/lib/python3.8/site-packages/artm/wrapper/api.py", line 161, in artm_api_call self._check_error(result) File "/home/roman/.local/lib/python3.8/site-packages/artm/wrapper/api.py", line 97, in _check_error raise exception_class(error_message) artm.wrapper.exceptions.DiskReadException: File vocab.kos.txt does not exist.

opened by RomanAvdeev 0

Releases(v0.10.1)

v0.10.1(Dec 30, 2019)
Several bugfixes related to transactions, loading and disposing models

Now it is possible to install artm with pip on Linux systems: pip install bigartm10

Source code(tar.gz)
Source code(zip)
v0.9.2(Dec 30, 2019)
Several bugfixes related to loading and disposing models

Now supports python 3.7 and Ipython consoles

Now it is possible to install artm with pip on Linux systems: pip install bigartm

We will be maintaining two versions (0.9.x and 0.10.x) of BigARTM in parallel, at least in a foreseeable future.
Source code(tar.gz)
Source code(zip)
v0.10.0(Feb 25, 2019)
Bug fixes in co-occurrence gathering tool

Fixing bugs with AppVeyor-mingw

Support python 3.7

For more details refer to Release Notes. If you have any questions, please ask [email protected].
Source code(tar.gz)
Source code(zip)
v0.9.1(Feb 24, 2019)
Minor bug fixes and small new features

Add support of transactions

For more details refer to Release Notes. If you have any questions, please ask [email protected].
Source code(tar.gz)
Source code(zip)
v0.9.0(Nov 4, 2017)
There a some changes besides numerous bug fixes and small improvements:

Add support for entire model saving and loading via dump_artm_model and save_artm_model methods of artm.ARTM

Basic support for building and installing platform-dependent Python wheels

VisARTM documentation is moved to its own website, now there is only link to it

IMPORTANT Few Python dependencies were added - scipy (required) and wheel (for building Python package, optional). Please install them with pip install.

For more details refer to Release Notes. If you have any questions, please ask [email protected].
Source code(tar.gz)
Source code(zip)
BigARTM_v0.9.0_win64.7z(4.84 MB)
v0.8.3(Feb 13, 2017)
Minor bug fixes and small new features

Full compatibility with v0.8.2 API

Our project receives its own , that's why it's possible now to cite it :)

IMPORTANT: There are significant changes in the layout and structure of documentation (e.g. section about web-based visualization called VisARTM was added).

For more details refer to Release Notes. If you have any questions, please ask [email protected].
Source code(tar.gz)
Source code(zip)
BigARTM_v0.8.3_win64.7z(4.66 MB)
v0.8.2(Dec 10, 2016)
Add support Python 3 (and continue to support Python 2)

Numerous bug fixes and small new features

Full compatibility with v0.8.1 API

IMPORTANT: BigARTM v0.8.2 adds few dependencies in python. Please install them with pip install.

tqdm library; to install use pip install tqdm

protobuf-3.0.0 library; to install use pip install protobuf==3.0.0

For more details refer to Release Notes. If you have any questions, please ask [email protected].
Source code(tar.gz)
Source code(zip)
BigARTM_v0.8.2_win64.7z(4.70 MB)
v0.8.1(Jun 21, 2016)

The recommended package for most users is vs12_win64_RelWithDebInfo.7z. Please refer to Release Notes to see the list of the changes. If you have any questions, please ask [email protected].
Source code(tar.gz)
Source code(zip)
BigARTM_v0.8.1_vs11_win32_debug.7z(4.34 MB)
BigARTM_v0.8.1_vs11_win32_RelWithDebInfo.7z(3.45 MB)
BigARTM_v0.8.1_vs11_win64_debug.7z(5.40 MB)
BigARTM_v0.8.1_vs11_win64_RelWithDebInfo.7z(3.96 MB)
BigARTM_v0.8.1_vs12_win32_debug.7z(4.24 MB)
BigARTM_v0.8.1_vs12_win32_RelWithDebInfo.7z(3.41 MB)
BigARTM_v0.8.1_vs12_win64_debug.7z(5.28 MB)
BigARTM_v0.8.1_vs12_win64_RelWithDebInfo.7z(3.92 MB)
v0.8.0(May 6, 2016)
From BigARTM v0.8.0 and onward the Release Notes are moved into the documentation:

Changes in Python API

Changes in Protobuf Messages

Changes in BigARTM CLI

Changes in c_interface

The main focus for v0.8.0 release was to perform minor cleanup in the APIs, by renaming methods for better consistency, or moving parameters to where they belong. We also fixed messages.proto to copile with C#. The biggest conceptual change is in Python API, where we introduced a new class Dictionary. This change however is still very straightforward.

We strongly recommend to review the Release Notes for the APIs that you are using, as it describes all renames and other minor changes that we made in the APIs. If you have any questions, please ask [email protected].

The recommended package for most users is vs12_win64_RelWithDebInfo.7z.
Source code(tar.gz)
Source code(zip)
BigARTM_v0.8.0_vs11_win32_debug.7z(4.33 MB)
BigARTM_v0.8.0_vs11_win32_RelWithDebInfo.7z(3.46 MB)
BigARTM_v0.8.0_vs11_win64_debug.7z(5.40 MB)
BigARTM_v0.8.0_vs11_win64_RelWithDebInfo.7z(3.96 MB)
BigARTM_v0.8.0_vs12_win32_debug.7z(4.24 MB)
BigARTM_v0.8.0_vs12_win32_RelWithDebInfo.7z(3.41 MB)
BigARTM_v0.8.0_vs12_win64_debug.7z(5.27 MB)
BigARTM_v0.8.0_vs12_win64_RelWithDebInfo.7z(3.92 MB)
v0.7.6(May 6, 2016)

This is a minor release mostly containing bug fixes. For all changes see the diff https://github.com/bigartm/bigartm/compare/v0.7.5...v0.7.6

The recommended package for most users is vs12_win64_RelWithDebInfo.7z.

Questions? Ask [email protected]!
Source code(tar.gz)
Source code(zip)
BigARTM_v0.7.6_vs11_win32_debug.7z(4.45 MB)
BigARTM_v0.7.6_vs11_win32_RelWithDebInfo.7z(3.55 MB)
BigARTM_v0.7.6_vs11_win64_debug.7z(5.55 MB)
BigARTM_v0.7.6_vs11_win64_RelWithDebInfo.7z(4.09 MB)
BigARTM_v0.7.6_vs12_win32_debug.7z(4.35 MB)
BigARTM_v0.7.6_vs12_win32_RelWithDebInfo.7z(3.52 MB)
BigARTM_v0.7.6_vs12_win64_debug.7z(5.44 MB)
BigARTM_v0.7.6_vs12_win64_RelWithDebInfo.7z(4.04 MB)
v0.7.5(Mar 27, 2016)

This is a minor release mostly containing bug fixes. For all changes see the diff https://github.com/bigartm/bigartm/compare/v0.7.4...v0.7.5

The recommended package for most users is vs12_win64_RelWithDebInfo.7z.

Questions? Ask [email protected]!
Source code(tar.gz)
Source code(zip)
BigARTM_v0.7.5_vs11_win32_debug.7z(4.46 MB)
BigARTM_v0.7.5_vs11_win32_RelWithDebInfo.7z(3.55 MB)
BigARTM_v0.7.5_vs11_win64_debug.7z(5.55 MB)
BigARTM_v0.7.5_vs11_win64_RelWithDebInfo.7z(4.08 MB)
BigARTM_v0.7.5_vs12_win32_debug.7z(4.36 MB)
BigARTM_v0.7.5_vs12_win32_RelWithDebInfo.7z(3.51 MB)
BigARTM_v0.7.5_vs12_win64_debug.7z(5.43 MB)
BigARTM_v0.7.5_vs12_win64_RelWithDebInfo.7z(4.06 MB)
v0.7.4(Feb 6, 2016)
Please refer to BigARTM v0.7.4 Release notes.

We are introducing bigartm/stable branch to split main development (master) and hot fixes (stable). Every release we will bring stable forward to master.

New low-level API ArtmMasterModel (c_interface.h) gives simpler APIs to infer topic models and apply them to new data

We've done a major rework around dictionaries. Beware of API breaking changes and format changes!

Option for static linkage of bigartm CLI on Linux (on by default)

Option to install BigARTM python interface via python setup.py install

Other fixes and improvements in python API, C++ API and BigARTM CLI.

For all changes see the diff https://github.com/bigartm/bigartm/compare/v0.7.3...v0.7.4

Questions? Ask [email protected]!
Source code(tar.gz)
Source code(zip)
BigARTM_v0.7.4_vs11_win32_debug.7z(4.44 MB)
BigARTM_v0.7.4_vs11_win32_RelWithDebInfo.7z(3.54 MB)
BigARTM_v0.7.4_vs11_win64_debug.7z(5.53 MB)
BigARTM_v0.7.4_vs11_win64_RelWithDebInfo.7z(4.07 MB)
BigARTM_v0.7.4_vs12_win32_debug.7z(4.34 MB)
BigARTM_v0.7.4_vs12_win32_RelWithDebInfo.7z(3.50 MB)
BigARTM_v0.7.4_vs12_win64_debug.7z(5.41 MB)
BigARTM_v0.7.4_vs12_win64_RelWithDebInfo.7z(4.03 MB)
v0.7.3(Oct 28, 2015)
Please refer to BigARTM v0.7.3 Release notes.

New CLI for BigARTM (bigartm on Linux, bigartm.exe on Windows)

Support for classification in BigARTM CLI

Support for asynchronous processing of batches

Improvements in coherence regularizer and coherence score

New TopicMass score for phi matrix

Support for documents markup (aka ptdw matrices)

New API for importing batches through memory

Other changes in https://github.com/bigartm/bigartm/compare/v0.7.2...v0.7.3

Questions? Ask [email protected]!
Source code(tar.gz)
Source code(zip)
BigARTM_v0.7.3_win32.7z(3.46 MB)
BigARTM_v0.7.3_x64.7z(4.00 MB)
v0.7.2(Sep 9, 2015)
Please refer to BigARTM v0.7.2 Release notes.

Enhancements in high-level python API (ArtmModel -> ARTM)

Enhancements in low-level python API (library.py -> master_component.py)

Enhancements in CLI interface (cpp_client)

New feature: status retrieval from BigARTM

New feature: float token counts (token_count -> token_weight)

Allow custom weights for each batch (ProcessBatchesArgs.batch_weight)

Bug fixes and cleanup in the online documentation

Other changes in https://github.com/bigartm/bigartm/compare/v0.7.1...v0.7.2

Questions? Ask [email protected]!
Source code(tar.gz)
Source code(zip)
BigARTM_v0.7.2_win32.7z(6.49 MB)
BigARTM_v0.7.2_x64.7z(6.95 MB)
v0.7.1(Jul 13, 2015)
Please refer to BigARTM v0.7.1 Release notes.

New features

BigARTM notebooks in English and in Russian --- new source of information about BigARTM

ArtmModel --- a brand new Python API, documented in English and in Russian

Much faster retrieval of Phi and Theta matrices from Python

Much faster dictionary imports from Python

Auto-detect and use all CPU cores by default

Fixed Import/Export of topic models (was broken in v0.7.0)

New capability to implement Phi-regularizers in Python code

Improvements in Coherence score

Other changes in https://github.com/bigartm/bigartm/compare/v0.7.0...v0.7.1

New examples

example20_attach_model.py

Questions? Ask [email protected]!
Source code(tar.gz)
Source code(zip)
BigARTM_v0.7.1_win32.7z(7.21 MB)
BigARTM_v0.7.1_x64.7z(6.78 MB)
v0.7.0(Jun 9, 2015)
Please refer to BigARTM v0.7.0 Release notes.

New features

New-style models (ProcessBatchese / MergeModel / RegularizeModel / NormalizeModel APIs)

Network modus operandi had been removed. Let us know if need it back!

Coherence regularizer and scores (experimental)

Other changes in https://github.com/bigartm/bigartm/compare/v0.6.4...v0.7.0

New examples

example17_process_batches.py

example18_merge_model.py

example19_regularize_model.py

Questions? Ask [email protected]!
Source code(tar.gz)
Source code(zip)
BigARTM_v0.7.0_win32.7z(6.88 MB)
BigARTM_v0.7.0_x64.7z(8.66 MB)
v0.6.4(May 4, 2015)
Changes in https://github.com/bigartm/bigartm/compare/v0.6.3...v0.6.4

New option to export and import topic model in binary format Example is available in example15_import_export_topic_model

New option of relative regularization for Phi matrix

Remove MasterProxy feature (please let us know if you need it back!)

Source code(tar.gz)
Source code(zip)
BigARTM_v0.6.4_win32.7z(8.36 MB)
BigARTM_v0.6.4_x64.7z(9.66 MB)
v0.6.3(Apr 12, 2015)
Changes in https://github.com/bigartm/bigartm/compare/v0.6.2...v0.6.3

New option to initialize topic model based on folders with batches. Example is available in example14_initialize_topic_model

New options to retrieve and restore topic model into BigARTM library. Example is available in example13_overwrite_topic_model

New regularizer LabelRegularizationPhi is added

Better error and info logging, and better validation of input data

Check documentation for new fields:

ModelConfig.use_new_tokens

TopicModel.operation_type

GetTopicModelArgs.request_type

ModelConfig.score_name (deprecated option)

Source code(tar.gz)
Source code(zip)
BigARTM_v0.6.3_win32.7z(7.50 MB)
BigARTM_v0.6.3_x64.7z(8.20 MB)
v0.6.2(Mar 27, 2015)
Bug fix:

#171 Ensure repeatable results even without explicit phi initialization

Source code(tar.gz)
Source code(zip)
BigARTM_v0.6.2_win32.7z(8.32 MB)
BigARTM_v0.6.2_x64.7z(10.05 MB)
v0.6.1(Mar 23, 2015)
Changes since v0.6.0

#168 New options GetTopicModelArgs.use_sparse_format and GetThetaMatrixArgs.use_sparse_format for efficient retrieval of sparse TopicModel and ThetaMatrix messages

#167 Slow performance in GetTopicModel

#166 Inconsistent TopicCount field returned by ArtmGetTopicModel and ArtmGetThetaMatrix

Source code(tar.gz)
Source code(zip)
BigARTM_v0.6.1_win32.7z(7.33 MB)
BigARTM_v0.6.1_x64.7z(7.84 MB)
v0.6.0(Mar 19, 2015)
Changes since v0.5.9

Bugfix for better error and exception handling

More flexible options to calculate perplexity in multimodal configuration

VowpalWabbit parser (undocumented)

Source code(tar.gz)
Source code(zip)
BigARTM_v0.6.0_win32.7z(7.38 MB)
BigARTM_v0.6.0_x64.7z(7.46 MB)
v0.5.9(Mar 6, 2015)
Changes since v0.5.8

Introduce AddBatchArgs.batch_file_name field

Improve example04_online_algorithm.py

Implement example09_regularizers.py

Improve Python API

Bug fixes

Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.9_win32.7z(8.19 MB)
BigARTM_v0.5.9_x64.7z(8.89 MB)
v0.5.8(Feb 21, 2015)
Changes since v0.5.7

Allow to manually enqueue batches for processing from python via master.AddBatch() method

#131 Reduce memory footprint in TopTokens score

#135 Change ThetaSnippetScore to represent N last processed documents (N = ThetaSnippetScoreConfig.item_count)

#127 Loop accross all fields in perplexity calculation

Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.8_win32.7z(7.30 MB)
BigARTM_v0.5.8_x64.7z(8.51 MB)
v0.5.7(Feb 15, 2015)
Changes since v0.5.6:

Improvements in online_batch_processing mode

Simpler online option ('update_every') in cpp_client

message Field enriched with additional data types

Enhanced performance and memory usage

#111 Problem with theta_snippet_score

Build with boost 1.57.0

Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.7_win32.7z(7.29 MB)
BigARTM_v0.5.7_x64.7z(9.69 MB)
v0.5.6(Jan 26, 2015)
Changes since v0.5.5:

Introduce ThetaMatrix.item_title field and populate it form Item.title

Fix perplexity score calculation for multi-class setting

Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.6_win32.7z(7.16 MB)
BigARTM_v0.5.6_x64.7z(7.71 MB)
v0.5.5(Jan 21, 2015)
Changes since v0.5.4:

Simplified configurations for SmoothSparsePhi, SmoothSparseTheta and DecorrelatorPhi regularizers

Ensure repeatable results from run to run

Ability to calculate scores on a new batch (see GetScoreValueArgs.batch option in docs)

Stability fixes for online algorithm and multi-class topic models

Convenient Item.title field

Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.5_win32.7z(7.16 MB)
BigARTM_v0.5.5_x64.7z(7.58 MB)
v0.5.4(Nov 23, 2014)
New features since v0.5.3:

new convenient executable (cpp_client.exe) to experiment with various BigARTM features

significantly improved performance in Processor

new wikipedia dataset available in batches format (links are in the tutorial section in online documentation)

reliability fixes for network modus operandi

new parser for input collections in MarketMatrix format (compatible with gensim)

more flexible options in GetThetaMatrix() and GetTopicModel() methods

disk caching option to improve memory usage together with cache_theta=true option

Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.4_win32.7z(8.00 MB)
BigARTM_v0.5.4_x64.7z(8.55 MB)
v0.5.3(Oct 21, 2014)
New features since v0.5.2:

New GetThetaMatrixArgs.batch field allows to pass batch of items to ArtmRequestThetaMatrix() method. This allows you to classify a new set of items with an existing topic model.

New ModelConfig.name field allows to give names for topic model. ArtmReconfigureModel() method now allows changing (add/remove) topic models.

New ArtmInitializeModel() method allows to initialize topic model based on a dictionary

New MasterComponentConfig.online_batch_processing field allows you to dynamically add data during processing. See documentation for details.

Include .pdb files (Program Debug Database) in the release to enable debugging

Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.3_win32.7z(5.99 MB)
BigARTM_v0.5.3_x64.7z(6.40 MB)
v0.5.2(Sep 28, 2014)
This is a distribution package of BigARTM for Windows.

Changes since v0.5.1

Add model.Synchronize() capability for online algorithm

Simplify python interface and add more examples in the documentation.

Comparing to the previous release you must now configure two new environmental variables:

set PATH=%PATH%;C:\BigARTM\bin set PYTHONPATH=%PYTHONPATH%;C:\BigARTM\Python

See http://docs.bigartm.org/en/latest/tutorial.html for more details.

Note that due to large rewrite in python interface this release is not backward-compatible with v0.5.1 but it should be fairly simple to upgrade your python scripts to work against v0.5.2.
Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.2_win32.7z(1.79 MB)
BigARTM_v0.5.2_x64.7z(1.99 MB)
v0.5.1(Sep 20, 2014)

This is a distribution package of BigARTM for Windows.
Source code(tar.gz)
Source code(zip)
BigARTM_v0.5.1_win32.7z(1.79 MB)
BigARTM_v0.5.1_x64.7z(1.98 MB)

Fast topic modeling platform

Related tags

Overview

What is BigARTM?

References

Related Software Packages

Installation

Installing with pip (Linux only)

Installing on Windows

Installing on Linux / MacOS

How to Use

Command-line interface

Interactive Python interface

Low-level API

Contributing

License

Comments

Protocol Buffers v3.18.3

C++

Protocol Buffers v3.16.1

Java

Protocol Buffers v3.18.2

Java

Protocol Buffers v3.18.1

Python

Ruby

Protocol Buffers v3.18.0

C++

Releases(v0.10.1)

v0.10.1(Dec 30, 2019)

v0.9.2(Dec 30, 2019)

v0.10.0(Feb 25, 2019)

v0.9.1(Feb 24, 2019)

v0.9.0(Nov 4, 2017)

v0.8.3(Feb 13, 2017)

v0.8.2(Dec 10, 2016)

v0.8.1(Jun 21, 2016)

v0.8.0(May 6, 2016)

v0.7.6(May 6, 2016)

v0.7.5(Mar 27, 2016)

v0.7.4(Feb 6, 2016)

v0.7.3(Oct 28, 2015)

v0.7.2(Sep 9, 2015)

v0.7.1(Jul 13, 2015)

v0.7.0(Jun 9, 2015)

v0.6.4(May 4, 2015)

v0.6.3(Apr 12, 2015)

v0.6.2(Mar 27, 2015)

v0.6.1(Mar 23, 2015)

v0.6.0(Mar 19, 2015)

v0.5.9(Mar 6, 2015)

v0.5.8(Feb 21, 2015)

v0.5.7(Feb 15, 2015)

v0.5.6(Jan 26, 2015)

v0.5.5(Jan 21, 2015)

v0.5.4(Nov 23, 2014)

v0.5.3(Oct 21, 2014)

v0.5.2(Sep 28, 2014)

v0.5.1(Sep 20, 2014)

Owner

BigARTM

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.

NLP codes implemented with Pytorch (w/o library such as huggingface)

Document processing using transformers

The code from the whylogs workshop in DataTalks.Club on 29 March 2022

Implementation of Multistream Transformers in Pytorch

Chinese segmentation library

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Sequence model architectures from scratch in PyTorch

TPlinker for NER 中文/英文命名实体识别

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

Conversational text Analysis using various NLP techniques

Comprehensive-E2E-TTS - PyTorch Implementation

Summarization module based on KoBART

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

Amazon Multilingual Counterfactual Dataset (AMCD)

Natural Language Processing Best Practices & Examples