Chinese segmentation library

Last update: Jun 28, 2022

Related tags

Overview

What is loso?

loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc.

Copyright & Licnese

Setup loso

To install loso, clone the repo and run following command

cd loso
python setup.py develop

Also, you need to run a redis database for storing the lexicon database. Also, you need to copy configuration template and modify it.

cp default.yaml myconf.yaml
vim myconf.yaml

To use your configuration, you have to set the configuration environment variable LOSO_CONFIG_FILE. For example:

LOSO_CONFIG_FILE=myconfig.yaml python setup.py server

Use loso

Loso determines segmentation according to the lexicon database, and the algorithm is based on Hidden Makov Model, therefore, it is not possible to use the service before building a lexicon database.

To feed a text file to the database, here you can run

python setup.py feed -f /home/victorlin/plurk_src/realtime_search/word_segment/sample_data/sample_tr_ch

To clean the database, you can run

python setup.py reset

To interact and test for splitting terms, here you can run

python setup.py interact

For example

Text: 留下鉅細靡遺的太空梭發射影片，供世人回味
....
留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

To use the segmentation service as XMLRPC service, here you can run

python setup.py serve

Following is a simple Python program for showing how to use it

import xmlrpclib

proxy = xmlrpclib.ServerProxy("http://localhost:5566/")

terms = proxy.splitTerms(u'留下鉅細靡遺的太空梭發射影片，供世人回味')
print ' '.join(terms)

And the output should be

留下 鉅細靡遺 的 太空梭 發射 影片 供 世人 回味

Chinese segmentation library

Related tags

Overview

What is loso?

Copyright & Licnese

Setup loso

Use loso

Owner

Fang-Pen Lin

Extract Keywords from sentence or Replace keywords in sentences.

BiQE: Code and dataset for the BiQE paper

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

Text Classification in Turkish Texts with Bert

A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

The Classical Language Toolkit

A Python script which randomly chooses and prints a file from a directory.

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Course project of [email protected]

Leon is an open-source personal assistant who can live on your server.

Fidibo.com comments Sentiment Analyser

Simple python code to fix your combo list by removing any text after a separator or removing duplicate combos

CoSENT、STS、SentenceBERT

FactSumm: Factual Consistency Scorer for Abstractive Summarization

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Longformer: The Long-Document Transformer

A framework for evaluating Knowledge Graph Embedding Models in a fine-grained manner.