pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Overview

pkuseg:一个多领域中文分词工具包 (English Version)

pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。

目录

主要亮点

pkuseg具有如下几个特点:

  1. 多领域分词。不同于以往的通用中文分词工具,此工具包同时致力于为不同领域的数据提供个性化的预训练模型。根据待分词文本的领域特点,用户可以自由地选择不同的模型。 我们目前支持了新闻领域,网络领域,医药领域,旅游领域,以及混合领域的分词预训练模型。在使用中,如果用户明确待分词的领域,可加载对应的模型进行分词。如果用户无法确定具体领域,推荐使用在混合领域上训练的通用模型。各领域分词样例可参考 example.txt
  2. 更高的分词准确率。相比于其他的分词工具包,当使用相同的训练数据和测试数据,pkuseg可以取得更高的分词准确率。
  3. 支持用户自训练模型。支持用户使用全新的标注数据进行训练。
  4. 支持词性标注。

编译和安装

  • 目前仅支持python3
  • 为了获得好的效果和速度,强烈建议大家通过pip install更新到目前的最新版本
  1. 通过PyPI安装(自带模型文件):

    pip3 install pkuseg
    之后通过import pkuseg来引用
    

    建议更新到最新版本以获得更好的开箱体验:

    pip3 install -U pkuseg
    
  2. 如果PyPI官方源下载速度不理想,建议使用镜像源,比如:
    初次安装:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
    

    更新:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
    
  3. 如果不使用pip安装方式,选择从GitHub下载,可运行以下命令安装:

    python setup.py build_ext -i
    

    GitHub的代码并不包括预训练模型,因此需要用户自行下载或训练模型,预训练模型可详见release。使用时需设定"model_name"为模型文件。

注意:安装方式1和2目前仅支持linux(ubuntu)、mac、windows 64 位的python3版本。如果非以上系统,请使用安装方式3进行本地编译安装。

各类分词工具包的性能对比

我们选择jieba、THULAC等国内代表分词工具包与pkuseg做性能比较,详细设置可参考实验环境

细领域训练及测试结果

以下是在不同数据集上的对比结果:

MSRA Precision Recall F-score
jieba 87.01 89.88 88.42
THULAC 95.60 95.91 95.71
pkuseg 96.94 96.81 96.88
WEIBO Precision Recall F-score
jieba 87.79 87.54 87.66
THULAC 93.40 92.40 92.87
pkuseg 93.78 94.65 94.21

默认模型在不同领域的测试效果

考虑到很多用户在尝试分词工具的时候,大多数时候会使用工具包自带模型测试。为了直接对比“初始”性能,我们也比较了各个工具包的默认模型在不同领域的测试效果。请注意,这样的比较只是为了说明默认情况下的效果,并不一定是公平的。

Default MSRA CTB8 PKU WEIBO All Average
jieba 81.45 79.58 81.83 83.56 81.61
THULAC 85.55 87.84 92.29 86.65 88.08
pkuseg 87.29 91.77 92.68 93.43 91.29

其中,All Average显示的是在所有测试集上F-score的平均。

更多详细比较可参见和现有工具包的比较

使用方式

代码示例

以下代码示例适用于python交互式环境。

代码示例1:使用默认配置进行分词(如果用户无法确定分词领域,推荐使用默认模型分词

import pkuseg

seg = pkuseg.pkuseg()           # 以默认配置加载模型
text = seg.cut('我爱北京天安门')  # 进行分词
print(text)

代码示例2:细领域分词(如果用户明确分词领域,推荐使用细领域模型分词

import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型
text = seg.cut('我爱北京天安门')              # 进行分词
print(text)

代码示例3:分词同时进行词性标注,各词性标签的详细含义可参考 tags.txt

import pkuseg

seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
print(text)

代码示例4:对文件分词

import pkuseg

# 对input.txt的文件分词输出到output.txt中
# 开20个进程
pkuseg.test('input.txt', 'output.txt', nthread=20)     

其他使用示例可参见详细代码示例

参数说明

模型配置

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
			        "default",默认参数,表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。
				"news", 使用新闻领域模型。
				"web", 使用网络领域模型。
				"medicine", 使用医药领域模型。
				"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
				"default", 默认参数,使用我们提供的词典。
				None, 不使用词典。
				dict_path, 在使用默认词典的同时会额外使用用户自定义词典,可以填自己的用户词典的路径,词典格式为一行一个词(如果选择进行词性标注并且已知该词的词性,则在该行写下词和词性,中间用tab字符隔开)。
	postag		        是否进行词性分析。
				False, 默认参数,只进行分词,不进行词性标注。
				True, 会在分词的同时进行词性标注。

对文件进行分词

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

模型训练

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		训练文件路径。
	testFile		测试文件路径。
	savedir			训练模型的保存路径。
	train_iter		训练轮数。
	init_model		初始化模型,默认为None表示使用默认初始化,用户可以填自己想要初始化的模型的路径如init_model='./models/'。

多进程分词

当将以上代码示例置于文件中运行时,如涉及多进程功能,请务必使用if __name__ == '__main__'保护全局语句,详见多进程分词

预训练模型

从pip安装的用户在使用细领域分词功能时,只需要设置model_name字段为对应的领域即可,会自动下载对应的细领域模型。

从github下载的用户则需要自己下载对应的预训练模型,并设置model_name字段为预训练模型路径。预训练模型可以在release部分下载。以下是对预训练模型的说明:

  • news: 在MSRA(新闻语料)上训练的模型。

  • web: 在微博(网络文本语料)上训练的模型。

  • medicine: 在医药领域上训练的模型。

  • tourism: 在旅游领域上训练的模型。

  • mixed: 混合数据集训练的通用模型。随pip包附带的是此模型。

欢迎更多用户可以分享自己训练好的细分领域模型。

版本历史

详见版本历史

开源协议

  1. 本代码采用MIT许可证。
  2. 欢迎对该工具包提出任何宝贵意见和建议,请发邮件至[email protected]

论文引用

该代码包主要基于以下科研论文,如使用了本工具,请引用以下论文:


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

其他相关论文

  • Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
  • Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
  • Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

常见问题及解答

  1. 为什么要发布pkuseg?
  2. pkuseg使用了哪些技术?
  3. 无法使用多进程分词和训练功能,提示RuntimeError和BrokenPipeError。
  4. 是如何跟其它工具包在细领域数据上进行比较的?
  5. 在黑盒测试集上进行比较的话,效果如何?
  6. 如果我不了解待分词语料的所属领域呢?
  7. 如何看待在一些特定样例上的分词结果?
  8. 关于运行速度问题?
  9. 关于多进程速度问题?

致谢

感谢俞士汶教授(北京大学计算语言所)与邱立坤博士提供的训练数据集!

作者

Ruixuan Luo (罗睿轩), Jingjing Xu(许晶晶), Xuancheng Ren(任宣丞), Yi Zhang(张艺), Bingzhen Wei(位冰镇), Xu Sun (孙栩)

北京大学 语言计算与机器学习研究组

Comments
  • 与其余分词工具包的性能对比并不公平吧?

    与其余分词工具包的性能对比并不公平吧?

    请问一下对比的jieba 和 THULAC 模型有用对应的训练语料(MSRA,CTB8)训练么? 如果有训练语料的话,这两个模型的结果应该不会那么差。80%左右的F值都快和unsupervised segmentation 差不多了。

    如果用in domain 训练语料训练的pkuseg 和 没有使用对应domain训练语料的jieba THULAC 对比,这样是显然不公平的啊。大幅提高了分词的准确率的结论不能通过这种对比实验得出。

    事实上MSRA 分词效果在论文里基本上都超过97.5了。

    opened by jiesutd 31
  • 就比较了一句话的结果就能和jieba一决胜负了

    就比较了一句话的结果就能和jieba一决胜负了

    pkuseg: seg = pkuseg.pkuseg() print(seg.cut('结婚的和尚未结婚的确实在干扰分词啊')) ['结婚', '的', '和尚', '未', '结婚', '的确', '实在', '干扰', '分词', '啊']

    jieba: print([i[0] for i in jieba.tokenize('结婚的和尚未结婚的确实在干扰分词啊')]) ['结婚', '的', '和', '尚未', '结婚', '的', '确实', '在', '干扰', '分词', '啊']

    一句话分错三个词,不知道如此高调的宣布远超jieba的勇气在哪儿 ......

    opened by mendynew 8
  • undefined symbol: PyFPE_jbuf

    undefined symbol: PyFPE_jbuf

    ImportError: /root/anaconda3/envs/NLP/lib/python3.5/site-packages/pkuseg/feature_extractor.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

    ubuntu, pip install pkuseg any ideas?

    opened by LCorleone 5
  • what is the required encode of input file?

    what is the required encode of input file?

    C:\Python36>python
    Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy as np
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg()           # 以默认配置加载模型
    >>> text = seg.cut('我爱北京天安门')  # 进行分词
    >>> print(text)
    ['我', '爱', '北京', '天安门']
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
    Downloading: "https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/postag.zip" to C:\Users\lutao/.pkuseg\
    postag.zip
    100.0%
    >>> text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
    >>> print(text)
    [('我', 'r'), ('爱', 'v'), ('北京', 'ns'), ('天安门', 'ns')]
    >>> import pkuseg
    >>>
    >>> # 对input.txt的文件分词输出到output.txt中
    ... # 开20个进程
    ... pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
      File "<stdin>", line 3
        pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
                                                           ^
    SyntaxError: invalid syntax
    >>> pkuseg.test('c:/user/lutao/downloads/0309a.txt', 'c:/user/lutao/downloads/0309a_output.txt', nthread=10)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 520, in test
        input_file, output_file, nthread, model_name, user_dict, postag, verbose
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 444, in _test_multi_proc
        raise Exception("input_file {} does not exist.".format(input_file))
    Exception: input_file c:/user/lutao/downloads/0309a.txt does not exist.
    

    I replaced '/' with '', and encode of 0309a.txt is gbk

    >>> pkuseg.test('c:\user\lutao\downloads\0309a.txt', 'c:\user\lutao\downloads\0309a_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    

    I save 0309a.txt to 0309b.txt as utf-8 encode,

    >>> pkuseg.test('c:\user\lutao\downloads\0309b.txt', 'c:\user\lutao\downloads\0309b_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    
    opened by l1t1 5
  • python3.6 import 失败

    python3.6 import 失败

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuse g Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: pkuseg in d:\dev_tools\python3.6\lib\site-package s (0.0.14) Requirement already satisfied: numpy in d:\dev_tools\python3.6\lib\site-packages (from pkuseg) (1.13.3+mkl)

    python Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AM D64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

    import pkuseg Traceback (most recent call last): File "", line 1, in File "D:\dev_tools\python3.6\lib\site-packages\pkuseg_init_.py", line 14, i n import pkuseg.trainer as trainer File "D:\dev_tools\python3.6\lib\site-packages\pkuseg\trainer.py", line 19, in

    import pkuseg.inference as _inf File "__init__.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expec ted 216 from C header, got 192 from PyObject
    opened by tangchun 5
  • 这个是什么问题导致的?

    这个是什么问题导致的?

    length = 1 : 0 length = 2 : 2496 length = 3 : 2642 length = 4 : 2568 length = 5 : 1313 length = 6 : 633 length = 7 : 249 length = 8 : 133 length = 9 : 66 length = 10 : 16 length = 11 : 6 length = 12 : 1 length = 13 : 1

    start training...

    reading training & test data... done! train/test data sizes: 1/1

    r: 1 iter0 diff=1.00e+100 train-time(sec)=5.64 f-score=0.06% iter1 diff=1.00e+100 train-time(sec)=5.63 f-score=0.00% Traceback (most recent call last): File "test.py", line 8, in pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/init.py", line 324, in train trainer.train(config) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 103, in train score_list = trainer.test(testset, i) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 169, in test testset, self.model, writer File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 357, in _decode_fscore gold_tags, pred_tags, self.idx_to_chunk_tag File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/scorer.py", line 37, in getFscore pre = correct_chunk / res_chunk * 100 ZeroDivisionError: division by zero

    opened by Fabyone 4
  • ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    利用pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg安装

    import pkuseg

    seg = pkuseg.pkuseg()

    text = "我爱北京天安门"

    cut = seg.cut(text) print(cut)

    Traceback (most recent call last): File "E:/python/work/spider/bx/piggy.py", line 1, in import pkuseg File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg_init_.py", line 14, in import pkuseg.trainer File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg\trainer.py", line 19, in import pkuseg.inference as _inf File "init.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    opened by xhochipe 4
  • FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    安装了pkuseg 初次使用,需要下载postag.zip 下载失败 我就自己下载,并放到文件夹下 但是有报错FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    opened by hjing100 3
  • 0.0.25在binder安装报错

    0.0.25在binder安装报错

    0.0.22 可以正常安装

    Collecting numpy
      Downloading numpy-1.19.0-cp37-cp37m-manylinux2010_x86_64.whl (14.6 MB)
    Collecting pkuseg
      Downloading pkuseg-0.0.25.tar.gz (48.8 MB)
        ERROR: Command errored out with exit status 1:
         command: /srv/conda/envs/notebook/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"'; __file__='"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-h22vfd4x
             cwd: /tmp/pip-install-5d95j8mq/pkuseg/
        Complete output (5 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-5d95j8mq/pkuseg/setup.py", line 5, in <module>
            import numpy as np
        ModuleNotFoundError: No module named 'numpy'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    opened by GoooIce 3
  • pip安装 使用细分领域模型 报错?

    pip安装 使用细分领域模型 报错?

    Traceback (most recent call last): 9 File "py3_cook_corpus_embedding.py", line 18, in <module> 10 seg = pkuseg.pkuseg(model_name='medicine') 11 File "/home/work/software/anaconda3/envs/py3myhao/lib/python3.6/site-packages/pkuseg/__init__.py", line 224, in __init__ 12 self.feature_extractor = FeatureExtractor.load() 13 File "pkuseg/feature_extractor.pyx", line 625, in pkuseg.feature_extractor.FeatureExtractor.load 14 FileNotFoundError: [Errno 2] No such file or directory: 'medicine/unigram_word.txt'

    另外,使用细分模型后,可以同时加上自定义词表吗?

    opened by kinghmy 3
  • wsl2 + pyenv + python3.8.5 安装报错.

    wsl2 + pyenv + python3.8.5 安装报错.

    (fastApi-env) [email protected]:/mnt/c/Users/Administrator$ pip install pkuseg Looking in indexes: http://mirrors.aliyun.com/pypi/simple Collecting pkuseg Downloading http://mirrors.aliyun.com/pypi/packages/64/3a/090a533c7f0682d653633cfd2d33e9aab3e671379fb199aeb7fa9bd3c34a/pkuseg-0.0.25.tar.gz (48.8 MB) |████████████████████████████████| 48.8 MB 79.6 MB/s ERROR: Command errored out with exit status 1: command: /home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"'; file='"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-99zrwcbj cwd: /tmp/pip-install-hjb0015_/pkuseg/ Complete output (36 lines): WARNING: The wheel package is not available. WARNING: The repository located at mirrors.aliyun.com is not a trusted or secure host and is being ignored. If this repository is available via HTTPS we recommend you use HTTPS instead, otherwise you may silence this warning and allow it anyway with '--trusted-host mirrors.aliyun.com'. ERROR: Could not find a version that satisfies the requirement cython (from versions: none) ERROR: No matching distribution found for cython Traceback (most recent call last): File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 128, in fetch_build_egg subprocess.check_call(cmd) File "/home/xiaxichen/.pyenv/versions/3.8.5/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 63, in <module>
        setup_package()
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 39, in setup_package
        setuptools.setup(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 162, in setup
        _install_setup_requires(attrs)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 157, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 699, in fetch_build_eggs
        resolved_dists = pkg_resources.working_set.resolve(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 779, in resolve
        dist = best[req.key] = env.best_match(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1064, in best_match
        return self.obtain(req, installer)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1076, in obtain
        return installer(requirement)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 758, in fetch_build_egg
        return fetch_build_egg(self, req)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 130, in fetch_build_egg
        raise DistutilsError(str(e)) from e
    distutils.errors.DistutilsError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.
    ----------------------------------------
    

    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

    opened by xiaxichen 2
  • cannot install in the environment of python 3.9

    cannot install in the environment of python 3.9

    Dear Sirs or Madams, the installation to the environment of python 3.9 failed. I check your repository at 'https://pypi.tuna.tsinghua.edu.cn/simple/pkuseg/'. It seems that there are no python 3.9 relevant files there. do you have any plan to support python 3.9? I also saw that in other issues, you suggested the file relevant to python 3.9. I cannot find the file. Your reply is highly appreciated. Tony

    opened by tonydeck0506 2
  • 词性标注效果过好

    词性标注效果过好

    理论上来讲效果好是一件好事,但是实际测试来讲会把不存在的地名也认作为地名

    import pkuseg
    seg = pkuseg.pkuseg(postag=True)
    text = seg.cut('广场镇是河北天津衡水冲绳东京的旧地狱和亚特兰斯地吗?')
    for word, flag in text: 
        if flag == 'ns':
            print (word)
    

    输出结果为:

    广场镇
    河北
    天津
    衡水
    冲绳
    东京
    亚特兰斯
    
    opened by axty666 1
  • TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    
    import pkuseg
    
    # 训练文件为'train.txt'
    # 测试文件为'test.txt'
    # 加载'./pretrained'目录下的模型,训练好的模型保存在'./models',训练10轮
    pkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')
    
    
    opened by KangChou 1
Releases(v0.0.25)
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
edge-SR: Super-Resolution For The Masses

edge-SR: Super Resolution For The Masses Citation Pablo Navarrete Michelini, Yunhua Lu and Xingqun Jiang. "edge-SR: Super-Resolution For The Masses",

Pablo 40 Nov 10, 2022
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Stuff related to Ben Eater's 8bit breadboard computer

8bit breadboard computer simulator This is an assembler + simulator/emulator of Ben Eater's 8bit breadboard computer. For a version with its RAM upgra

Marijn van Vliet 29 Dec 29, 2022
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
🕹 An esoteric language designed so that the program looks like the transcript of a Pokémon battle

PokéBattle is an esoteric language designed so that the program looks like the transcript of a Pokémon battle. Original inspiration and specification

Eduardo Correia 9 Jan 11, 2022
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

SentimentArcs - Emotion in Text An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text. E

jon_chun 14 Dec 19, 2022
Header-only C++ HNSW implementation with python bindings

Hnswlib - fast approximate nearest neighbor search Header-only C++ HNSW implementation with python bindings. NEWS: version 0.6 Thanks to (@dyashuni) h

2.3k Jan 05, 2023
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Blackstone Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project f

ICLR&D 579 Jan 08, 2023
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Dustin Smith 610 Sep 07, 2022
A simple implementation of N-gram language model.

About A simple implementation of N-gram language model. Requirements numpy Data preparation Corpus Training data for the N-gram model, a text file lik

4 Nov 24, 2021
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

OpenAI 1.3k Dec 28, 2022