pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Last update: Dec 29, 2022

Overview

pkuseg：一个多领域中文分词工具包 (English Version)

pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用，支持细分领域分词，有效提升了分词准确度。

主要亮点

pkuseg具有如下几个特点：

多领域分词。不同于以往的通用中文分词工具，此工具包同时致力于为不同领域的数据提供个性化的预训练模型。根据待分词文本的领域特点，用户可以自由地选择不同的模型。我们目前支持了新闻领域，网络领域，医药领域，旅游领域，以及混合领域的分词预训练模型。在使用中，如果用户明确待分词的领域，可加载对应的模型进行分词。如果用户无法确定具体领域，推荐使用在混合领域上训练的通用模型。各领域分词样例可参考 example.txt。
更高的分词准确率。相比于其他的分词工具包，当使用相同的训练数据和测试数据，pkuseg可以取得更高的分词准确率。
支持用户自训练模型。支持用户使用全新的标注数据进行训练。
支持词性标注。

编译和安装

目前仅支持python3
为了获得好的效果和速度，强烈建议大家通过pip install更新到目前的最新版本

通过PyPI安装(自带模型文件)：
```
pip3 install pkuseg
之后通过import pkuseg来引用
```
建议更新到最新版本以获得更好的开箱体验：
```
pip3 install -U pkuseg
```

如果PyPI官方源下载速度不理想，建议使用镜像源，比如：
初次安装：

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg

更新：

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg

如果不使用pip安装方式，选择从GitHub下载，可运行以下命令安装：
```
python setup.py build_ext -i
```
GitHub的代码并不包括预训练模型，因此需要用户自行下载或训练模型，预训练模型可详见release。使用时需设定"model_name"为模型文件。

注意：安装方式1和2目前仅支持linux(ubuntu)、mac、windows 64 位的python3版本。如果非以上系统，请使用安装方式3进行本地编译安装。

各类分词工具包的性能对比

我们选择jieba、THULAC等国内代表分词工具包与pkuseg做性能比较，详细设置可参考实验环境。

细领域训练及测试结果

以下是在不同数据集上的对比结果：

MSRA	Precision	Recall	F-score
jieba	87.01	89.88	88.42
THULAC	95.60	95.91	95.71
pkuseg	96.94	96.81	96.88

WEIBO	Precision	Recall	F-score
jieba	87.79	87.54	87.66
THULAC	93.40	92.40	92.87
pkuseg	93.78	94.65	94.21

默认模型在不同领域的测试效果

考虑到很多用户在尝试分词工具的时候，大多数时候会使用工具包自带模型测试。为了直接对比“初始”性能，我们也比较了各个工具包的默认模型在不同领域的测试效果。请注意，这样的比较只是为了说明默认情况下的效果，并不一定是公平的。

Default	MSRA	CTB8	PKU	WEIBO	All Average
jieba	81.45	79.58	81.83	83.56	81.61
THULAC	85.55	87.84	92.29	86.65	88.08
pkuseg	87.29	91.77	92.68	93.43	91.29

其中，All Average显示的是在所有测试集上F-score的平均。

更多详细比较可参见和现有工具包的比较。

使用方式

代码示例

以下代码示例适用于python交互式环境。

代码示例1：使用默认配置进行分词（如果用户无法确定分词领域，推荐使用默认模型分词）

import pkuseg

seg = pkuseg.pkuseg()           # 以默认配置加载模型
text = seg.cut('我爱北京天安门')  # 进行分词
print(text)

代码示例2：细领域分词（如果用户明确分词领域，推荐使用细领域模型分词）

import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型
text = seg.cut('我爱北京天安门')              # 进行分词
print(text)

代码示例3：分词同时进行词性标注，各词性标签的详细含义可参考 tags.txt

import pkuseg

seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
print(text)

代码示例4：对文件分词

import pkuseg

# 对input.txt的文件分词输出到output.txt中
# 开20个进程
pkuseg.test('input.txt', 'output.txt', nthread=20)

其他使用示例可参见详细代码示例。

参数说明

模型配置

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
			        "default"，默认参数，表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。
				"news", 使用新闻领域模型。
				"web", 使用网络领域模型。
				"medicine", 使用医药领域模型。
				"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
				"default", 默认参数，使用我们提供的词典。
				None, 不使用词典。
				dict_path, 在使用默认词典的同时会额外使用用户自定义词典，可以填自己的用户词典的路径，词典格式为一行一个词（如果选择进行词性标注并且已知该词的词性，则在该行写下词和词性，中间用tab字符隔开）。
	postag		        是否进行词性分析。
				False, 默认参数，只进行分词，不进行词性标注。
				True, 会在分词的同时进行词性标注。

对文件进行分词

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

模型训练

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		训练文件路径。
	testFile		测试文件路径。
	savedir			训练模型的保存路径。
	train_iter		训练轮数。
	init_model		初始化模型，默认为None表示使用默认初始化，用户可以填自己想要初始化的模型的路径如init_model='./models/'。

多进程分词

当将以上代码示例置于文件中运行时，如涉及多进程功能，请务必使用if __name__ == '__main__'保护全局语句，详见多进程分词。

预训练模型

从pip安装的用户在使用细领域分词功能时，只需要设置model_name字段为对应的领域即可，会自动下载对应的细领域模型。

从github下载的用户则需要自己下载对应的预训练模型，并设置model_name字段为预训练模型路径。预训练模型可以在release部分下载。以下是对预训练模型的说明：

news: 在MSRA（新闻语料）上训练的模型。
web: 在微博（网络文本语料）上训练的模型。
medicine: 在医药领域上训练的模型。
tourism: 在旅游领域上训练的模型。
mixed: 混合数据集训练的通用模型。随pip包附带的是此模型。

欢迎更多用户可以分享自己训练好的细分领域模型。

版本历史

详见版本历史。

开源协议

本代码采用MIT许可证。
欢迎对该工具包提出任何宝贵意见和建议，请发邮件至[email protected]。

论文引用

该代码包主要基于以下科研论文，如使用了本工具，请引用以下论文：

Ruixuan Luo, Jingjing Xu, Yi Zhang, Xuancheng Ren, Xu Sun. PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. Arxiv. 2019.


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

其他相关论文

Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

常见问题及解答

致谢

感谢俞士汶教授（北京大学计算语言所）与邱立坤博士提供的训练数据集！

作者

Ruixuan Luo （罗睿轩）, Jingjing Xu（许晶晶）, Xuancheng Ren（任宣丞）, Yi Zhang（张艺）, Bingzhen Wei（位冰镇）， Xu Sun （孙栩）

北京大学语言计算与机器学习研究组

Comments

与其余分词工具包的性能对比并不公平吧？

请问一下对比的jieba 和 THULAC 模型有用对应的训练语料（MSRA，CTB8）训练么？如果有训练语料的话，这两个模型的结果应该不会那么差。80%左右的F值都快和unsupervised segmentation 差不多了。

如果用in domain 训练语料训练的pkuseg 和没有使用对应domain训练语料的jieba THULAC 对比，这样是显然不公平的啊。大幅提高了分词的准确率的结论不能通过这种对比实验得出。

事实上MSRA 分词效果在论文里基本上都超过97.5了。

opened by jiesutd 31
就比较了一句话的结果就能和jieba一决胜负了

pkuseg: seg = pkuseg.pkuseg() print(seg.cut('结婚的和尚未结婚的确实在干扰分词啊')) ['结婚', '的', '和尚', '未', '结婚', '的确', '实在', '干扰', '分词', '啊']

jieba: print([i[0] for i in jieba.tokenize('结婚的和尚未结婚的确实在干扰分词啊')]) ['结婚', '的', '和', '尚未', '结婚', '的', '确实', '在', '干扰', '分词', '啊']

一句话分错三个词，不知道如此高调的宣布远超jieba的勇气在哪儿 ......

opened by mendynew 8
undefined symbol: PyFPE_jbuf

ImportError: /root/anaconda3/envs/NLP/lib/python3.5/site-packages/pkuseg/feature_extractor.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

ubuntu, pip install pkuseg any ideas?

opened by LCorleone 5

what is the required encode of input file?

C:\Python36>python
Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pkuseg
>>>
>>> seg = pkuseg.pkuseg()           # 以默认配置加载模型
>>> text = seg.cut('我爱北京天安门')  # 进行分词
>>> print(text)
['我', '爱', '北京', '天安门']
>>> import pkuseg
>>>
>>> seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
Downloading: "https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/postag.zip" to C:\Users\lutao/.pkuseg\
postag.zip
100.0%
>>> text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
>>> print(text)
[('我', 'r'), ('爱', 'v'), ('北京', 'ns'), ('天安门', 'ns')]
>>> import pkuseg
>>>
>>> # 对input.txt的文件分词输出到output.txt中
... # 开20个进程
... pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
  File "<stdin>", line 3
    pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
                                                       ^
SyntaxError: invalid syntax
>>> pkuseg.test('c:/user/lutao/downloads/0309a.txt', 'c:/user/lutao/downloads/0309a_output.txt', nthread=10)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 520, in test
    input_file, output_file, nthread, model_name, user_dict, postag, verbose
  File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 444, in _test_multi_proc
    raise Exception("input_file {} does not exist.".format(input_file))
Exception: input_file c:/user/lutao/downloads/0309a.txt does not exist.

I replaced '/' with '', and encode of 0309a.txt is gbk

>>> pkuseg.test('c:\user\lutao\downloads\0309a.txt', 'c:\user\lutao\downloads\0309a_output.txt', nthread=10)
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape

I save 0309a.txt to 0309b.txt as utf-8 encode,

>>> pkuseg.test('c:\user\lutao\downloads\0309b.txt', 'c:\user\lutao\downloads\0309b_output.txt', nthread=10)
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape

opened by l1t1 5

python3.6 import 失败

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuse g Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: pkuseg in d:\dev_tools\python3.6\lib\site-package s (0.0.14) Requirement already satisfied: numpy in d:\dev_tools\python3.6\lib\site-packages (from pkuseg) (1.13.3+mkl)

python Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AM D64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

import pkuseg Traceback (most recent call last): File "", line 1, in File "D:\dev_tools\python3.6\lib\site-packages\pkuseg_init_.py", line 14, i n import pkuseg.trainer as trainer File "D:\dev_tools\python3.6\lib\site-packages\pkuseg\trainer.py", line 19, in

import pkuseg.inference as _inf File "__init__.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expec ted 216 from C header, got 192 from PyObject

opened by tangchun 5
这个是什么问题导致的？

length = 1 : 0 length = 2 : 2496 length = 3 : 2642 length = 4 : 2568 length = 5 : 1313 length = 6 : 633 length = 7 : 249 length = 8 : 133 length = 9 : 66 length = 10 : 16 length = 11 : 6 length = 12 : 1 length = 13 : 1

start training...

reading training & test data... done! train/test data sizes: 1/1

r: 1 iter0 diff=1.00e+100 train-time(sec)=5.64 f-score=0.06% iter1 diff=1.00e+100 train-time(sec)=5.63 f-score=0.00% Traceback (most recent call last): File "test.py", line 8, in pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/init.py", line 324, in train trainer.train(config) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 103, in train score_list = trainer.test(testset, i) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 169, in test testset, self.model, writer File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 357, in _decode_fscore gold_tags, pred_tags, self.idx_to_chunk_tag File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/scorer.py", line 37, in getFscore pre = correct_chunk / res_chunk * 100 ZeroDivisionError: division by zero

opened by Fabyone 4
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

利用pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg安装

import pkuseg

seg = pkuseg.pkuseg()

text = "我爱北京天安门"

cut = seg.cut(text) print(cut)

Traceback (most recent call last): File "E:/python/work/spider/bx/piggy.py", line 1, in import pkuseg File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg_init_.py", line 14, in import pkuseg.trainer File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg\trainer.py", line 19, in import pkuseg.inference as _inf File "init.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

opened by xhochipe 4
FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

安装了pkuseg 初次使用，需要下载postag.zip 下载失败我就自己下载，并放到文件夹下但是有报错FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

opened by hjing100 3

0.0.25在binder安装报错

0.0.22 可以正常安装

Collecting numpy
  Downloading numpy-1.19.0-cp37-cp37m-manylinux2010_x86_64.whl (14.6 MB)
Collecting pkuseg
  Downloading pkuseg-0.0.25.tar.gz (48.8 MB)
    ERROR: Command errored out with exit status 1:
     command: /srv/conda/envs/notebook/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"'; __file__='"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-h22vfd4x
         cwd: /tmp/pip-install-5d95j8mq/pkuseg/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-5d95j8mq/pkuseg/setup.py", line 5, in <module>
        import numpy as np
    ModuleNotFoundError: No module named 'numpy'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

opened by GoooIce 3

pip安装使用细分领域模型报错？

Traceback (most recent call last): 9 File "py3_cook_corpus_embedding.py", line 18, in <module> 10 seg = pkuseg.pkuseg(model_name='medicine') 11 File "/home/work/software/anaconda3/envs/py3myhao/lib/python3.6/site-packages/pkuseg/__init__.py", line 224, in __init__ 12 self.feature_extractor = FeatureExtractor.load() 13 File "pkuseg/feature_extractor.pyx", line 625, in pkuseg.feature_extractor.FeatureExtractor.load 14 FileNotFoundError: [Errno 2] No such file or directory: 'medicine/unigram_word.txt'

另外，使用细分模型后，可以同时加上自定义词表吗？

opened by kinghmy 3
wsl2 + pyenv + python3.8.5 安装报错.
(fastApi-env) [email protected]:/mnt/c/Users/Administrator$ pip install pkuseg Looking in indexes: http://mirrors.aliyun.com/pypi/simple Collecting pkuseg Downloading http://mirrors.aliyun.com/pypi/packages/64/3a/090a533c7f0682d653633cfd2d33e9aab3e671379fb199aeb7fa9bd3c34a/pkuseg-0.0.25.tar.gz (48.8 MB) |████████████████████████████████| 48.8 MB 79.6 MB/s ERROR: Command errored out with exit status 1: command: /home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"'; file='"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-99zrwcbj cwd: /tmp/pip-install-hjb0015_/pkuseg/ Complete output (36 lines): WARNING: The wheel package is not available. WARNING: The repository located at mirrors.aliyun.com is not a trusted or secure host and is being ignored. If this repository is available via HTTPS we recommend you use HTTPS instead, otherwise you may silence this warning and allow it anyway with '--trusted-host mirrors.aliyun.com'. ERROR: Could not find a version that satisfies the requirement cython (from versions: none) ERROR: No matching distribution found for cython Traceback (most recent call last): File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 128, in fetch_build_egg subprocess.check_call(cmd) File "/home/xiaxichen/.pyenv/versions/3.8.5/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 63, in <module> setup_package() File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 39, in setup_package setuptools.setup( File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 162, in setup _install_setup_requires(attrs) File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 157, in _install_setup_requires dist.fetch_build_eggs(dist.setup_requires) File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 699, in fetch_build_eggs resolved_dists = pkg_resources.working_set.resolve( File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 779, in resolve dist = best[req.key] = env.best_match( File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1064, in best_match return self.obtain(req, installer) File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1076, in obtain return installer(requirement) File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 758, in fetch_build_egg return fetch_build_egg(self, req) File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 130, in fetch_build_egg raise DistutilsError(str(e)) from e distutils.errors.DistutilsError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1. ----------------------------------------

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
opened by xiaxichen 2
cannot install in the environment of python 3.9

Dear Sirs or Madams, the installation to the environment of python 3.9 failed. I check your repository at 'https://pypi.tuna.tsinghua.edu.cn/simple/pkuseg/'. It seems that there are no python 3.9 relevant files there. do you have any plan to support python 3.9? I also saw that in other issues, you suggested the file relevant to python 3.9. I cannot find the file. Your reply is highly appreciated. Tony

opened by tonydeck0506 2

词性标注效果过好

理论上来讲效果好是一件好事，但是实际测试来讲会把不存在的地名也认作为地名

import pkuseg
seg = pkuseg.pkuseg(postag=True)
text = seg.cut('广场镇是河北天津衡水冲绳东京的旧地狱和亚特兰斯地吗？')
for word, flag in text: 
    if flag == 'ns':
        print (word)

输出结果为：

广场镇
河北
天津
衡水
冲绳
东京
亚特兰斯

opened by axty666 1

TypeError: train() got an unexpected keyword argument 'nthread'


import pkuseg

# 训练文件为'train.txt'
# 测试文件为'test.txt'
# 加载'./pretrained'目录下的模型，训练好的模型保存在'./models'，训练10轮
pkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')

opened by KangChou 1

Releases(v0.0.25)

v0.0.25(Jun 6, 2022)

利用领域自适应方法得到的科学、艺术与文化、娱乐与体育领域的分词模型及优化后的通用模型
Source code(tar.gz)
Source code(zip)
art.zip(31.71 MB)
default_v2.zip(118.10 MB)
entertainment.zip(28.64 MB)
science.zip(30.96 MB)
v0.0.16(Feb 18, 2019)

新闻领域、网络领域、医药领域、旅游领域、混合领域分词模型和词性标注模型
Source code(tar.gz)
Source code(zip)
medicine.zip(45.95 MB)
mixed.zip(45.13 MB)
news.zip(41.74 MB)
postag.zip(39.50 MB)
tourism.zip(43.28 MB)
web.zip(16.66 MB)
v0.0.11(Jan 12, 2019)

MSRA、CTB8、WEIBO的模型文件
Source code(tar.gz)
Source code(zip)
ctb8.zip(41.48 MB)
msra.zip(54.34 MB)
weibo.zip(21.26 MB)

Owner

LancoPKU

Language Computing and Machine Learning Group (Xu Sun's group) at Peking University

GitHub Repository

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。声明本项目是为方便个人工作所创建的，仅有部分代码原创。

30 Dec 02, 2022

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

255 Dec 26, 2022

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

21 Dec 29, 2022

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

21 Dec 17, 2022

String Gen + Word Checker

Creates random strings and checks if any of them are a real words. Mostly a waste of time ngl but it is cool to see it work and the fact that it can generate a real random word within10sec

1 Jan 06, 2022

MASS: Masked Sequence to Sequence Pre-training for Language Generation

1.1k Dec 17, 2022

Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

303 Dec 17, 2022

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

Rank-One Model Editing (ROME) This repository provides an implementation of Rank-One Model Editing (ROME) on auto-regressive transformers (GPU-only).

130 Dec 21, 2022

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

0 Feb 08, 2022

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Related tags

Overview

pkuseg：一个多领域中文分词工具包 (English Version)

目录

主要亮点

编译和安装

各类分词工具包的性能对比

细领域训练及测试结果

默认模型在不同领域的测试效果

使用方式

代码示例

参数说明

多进程分词

预训练模型

版本历史

开源协议

论文引用

其他相关论文

常见问题及解答

致谢

作者

Comments

Releases(v0.0.25)

v0.0.25(Jun 6, 2022)

v0.0.16(Feb 18, 2019)

v0.0.11(Jan 12, 2019)

Owner

LancoPKU

SimpleChinese2 集成了许多基本的中文NLP功能，使基于 Python 的中文文字处理和信息提取变得简单方便。

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

String Gen + Word Checker

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Example code for "Real-World Natural Language Processing"

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

Pipeline for training LSA models using Scikit-Learn.

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Big Bird: Transformers for Longer Sequences

Utilize Korean BERT model in sentence-transformers library

Unlimited Call - Text Bombing Tool

2021 2학기 데이터크롤링 기말프로젝트

Natural Language Processing Best Practices & Examples