a chinese segment base on crf

Last update: Nov 04, 2022

Related tags

Text Data & NLP genius

Overview

Genius

Genius是一个开源的python中文分词组件，采用 CRF(Conditional Random Field)条件随机场算法。

Feature

支持python2.x、python3.x以及pypy2.x。
支持简单的pinyin分词
支持用户自定义break
支持用户自定义合并词典
支持词性标注

Source Install

安装git: 1) ubuntu or debian apt-get install git 2) fedora or redhat yum install git
下载代码：git clone https://github.com/duanhongyi/genius.git
安装代码：python setup.py install

Pypi Install

执行命令：easy_install genius或者pip install genius

Algorithm

采用trie树进行合并词典查找
基于wapiti实现条件随机场分词
可以通过genius.loader.ResourceLoader来重载默认的字典

功能 1)：分词`genius.seg_text`方法

genius.seg_text函数接受5个参数，其中text是必填参数:
text第一个参数为需要分词的字符
use_break代表对分词结构进行打断处理，默认值True
use_combine代表是否使用字典进行词合并，默认值False
use_tagging代表是否进行词性标注，默认值True
use_pinyin_segment代表是否对拼音进行分词处理，默认值True

代码示例( 全功能分词 )

#encoding=utf-8
import genius
text = u"""昨天,我和施瓦布先生一起与部分企业家进行了交流,大家对中国经济当前、未来发展的态势、走势都十分关心。"""
seg_list = genius.seg_text(
    text,
    use_combine=True,
    use_pinyin_segment=True,
    use_tagging=True,
    use_break=True
)
print('\n'.join(['%s\t%s' % (word.text, word.tagging) for word in seg_list]))

功能 2)：面向索引分词

genius.seg_keywords方法专门为搜索引擎索引准备，保留歧义分割，其中text是必填参数。
text第一个参数为需要分词的字符
use_break代表对分词结构进行打断处理，默认值True
use_tagging代表是否进行词性标注，默认值False
use_pinyin_segment代表是否对拼音进行分词处理，默认值False
由于合并操作与此方法有意义上的冲突，此方法并不提供合并功能；并且如果采用此方法做索引时候，检索时不推荐genius.seg_text使用use_combine=True参数。

代码示例

#encoding=utf-8
import genius

seg_list = genius.seg_keywords(u'南京市长江大桥')
print('\n'.join([word.text for word in seg_list]))

功能 3)：关键词提取

genius.extract_tag方法专门为提取tag关键字准备，其中text是必填参数。
text第一个参数为需要分词的字符
use_break代表对分词结构进行打断处理，默认值True
use_combine代表是否使用字典进行词合并，默认值False
use_pinyin_segment代表是否对拼音进行分词处理，默认值False

代码示例

#encoding=utf-8
import genius

tag_list = genius.extract_tag(u'南京市长江大桥')
print('\n'.join(tag_list))

其他说明 4)：

目前分词语料出自人民日报1998年1月份，所以对于新闻类文章分词较为准确。
CRF分词效果很大程度上依赖于训练语料的类别以及覆盖度，若解决语料问题分词和标注效果还有很大的提升空间。

a chinese segment base on crf

Related tags

Overview

Genius

Feature

Source Install

Pypi Install

Algorithm

功能 1)：分词`genius.seg_text`方法

功能 2)：面向索引分词

功能 3)：关键词提取

其他说明 4)：

Owner

duanhongyi

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

AI and Machine Learning workflows on Anthos Bare Metal.

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

Exploration of BERT-based models on twitter sentiment classifications

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

vits chinese, tts chinese, tts mandarin

GPT-3 command line interaction

CATs: Semantic Correspondence with Transformers

Sequence Modeling with Structured State Spaces

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

中文无监督SimCSE Pytorch实现

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

Simple NLP based project without any use of AI

a chinese segment base on crf

Related tags

Overview

Genius

Feature

Source Install

Pypi Install

Algorithm

功能 1)：分词genius.seg_text方法

功能 2)：面向索引分词

功能 3)：关键词提取

其他说明 4)：

Owner

duanhongyi

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

AI and Machine Learning workflows on Anthos Bare Metal.

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

Exploration of BERT-based models on twitter sentiment classifications

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

vits chinese, tts chinese, tts mandarin

GPT-3 command line interaction

CATs: Semantic Correspondence with Transformers

Sequence Modeling with Structured State Spaces

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

A Python wrapper for simple offline real-time dictation (speech-to-text) and speaker-recognition using Vosk.

中文无监督SimCSE Pytorch实现

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

Simple NLP based project without any use of AI

功能 1)：分词`genius.seg_text`方法