SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

Overview

SimpleChinese2

SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

声明

本项目是为方便个人工作所创建的,仅有部分代码原创。包括分词、词云在内的诸多功能来自于其他项目,并非本人所写,如遇问题,请至原项目链接下提问,谢谢!

安装

pip install -U simplechinese==0.2.8

如从 git 上 clone,需要从以下地址下载词向量文件:

https://drive.google.com/file/d/1ltyiTHZk8kIBYQGbZS9GoO_DwDOEWnL9/view?usp=sharing

并拷贝至"./simplechinese/data/"文件夹下

使用方法

import simplechinese as sc

1. 文字预处理

>> print(sc.only_digits(x)) # 仅保留数字 01234 >>> print(sc.only_zh(x)) # 仅保留中文 测试测试测试测试 >>> print(sc.only_en(x)) # 仅保留英文 TestING >>> print(sc.remove_space(x)) # 去除空格 测试测试,TestING;¥%&01234测试测试 >>> print(sc.remove_digits(x)) # 去除数字 测试测试,TestING ;¥%& 测试测试 >>> print(sc.remove_zh(x)) # 去除中文 ,TestING ;¥%& 01234 >>> print(sc.remove_en(x)) # 去除英文 测试测试, ;¥%& 01234测试测试 >>> print(sc.remove_punctuations(x)) # 去除标点符号 测试测试TestING 01234测试测试 >>> print(sc.toLower(x)) # 修改为全小写字母 测试测试,testing ;¥%& 01234测试测试 >>> print(sc.toUpper(x)) # 修改为全大写字母 测试测试,TESTING ;¥%& 01234测试测试 >>> x = "测试,TestING:12345@#【】+=-()。." >>> print(sc.punc_norm(x)) # 将中文标点符号转换成英文标点符号 测试,TestING:12345@#[]+=-().. >>> # y = fillna(df) # 将pandas.DataFrame中的N/A单元格填充为长度为0的str ">
>>> x = "测试测试,TestING    ;¥%& 01234测试测试"

>>> print(sc.only_digits(x))         # 仅保留数字
01234

>>> print(sc.only_zh(x))             # 仅保留中文
测试测试测试测试

>>> print(sc.only_en(x))             # 仅保留英文
TestING

>>> print(sc.remove_space(x))        # 去除空格
测试测试,TestING;¥%&01234测试测试

>>> print(sc.remove_digits(x))       # 去除数字
测试测试,TestING    ;¥%& 测试测试

>>> print(sc.remove_zh(x))           # 去除中文
,TestING    ;¥%& 01234

>>> print(sc.remove_en(x))           # 去除英文
测试测试,    ;¥%& 01234测试测试

>>> print(sc.remove_punctuations(x)) # 去除标点符号
测试测试TestING     01234测试测试

>>> print(sc.toLower(x))             # 修改为全小写字母
测试测试,testing    ;¥%& 01234测试测试

>>> print(sc.toUpper(x))             # 修改为全大写字母
测试测试,TESTING    ;¥%& 01234测试测试

>>> x = "测试,TestING:12345@#【】+=-()。."
>>> print(sc.punc_norm(x))           # 将中文标点符号转换成英文标点符号
测试,TestING:12345@#[]+=-()..

>>> # y = fillna(df) # 将pandas.DataFrame中的N/A单元格填充为长度为0的str

2. 基础NLP信息提取功能

该部分中,分词功能使用 jieba 实现,源码请参考:https://github.com/fxsjy/jieba

同/近义词查找功能复用了 synonyms 中的词向量数据文件,源码请参考:https://github.com/chatopera/Synonyms 但有所改动,改动如下

  1. 由于 pip 上传文件限制,synonyms 需要用户在完成 pip 安装后再下载词向量文件,国内下载需要设置镜像地址或使用特殊手段,有所不便。因此此处将词向量用 float16 表示,并使用 pca 降维至 64 维。总体效果差别不大,如果在意,请直接安装 synonyms 处理同/近义词查找任务。

  2. 原项目通过构建 KDTree 实现快速查找,但比较相似度是使用 cosine similarity,而 KDTree (sklearn) 本身不支持通过 cosine similarity 构建。因此原项目使用欧式距离构建树,导致输出结果有部分顺序混乱。为修复该问题,本项目将词向量归一化后再构建 KDTree,使得向量间的 cosine similarity 与欧式距离(即割线距离)正相关。具体推导可参考下文:https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier

  3. 原项目中未设置缓存上限,本项目中仅保留最近10000次查找记录。

x = "今天是我参加工作的第1天,我花了23.33元买了写零食犒劳一下自己。"
print(sc.extract_nums(x))              # 提取数字信息
[1.0, 23.33]

# mode: 0: No single character words. The words may be overlapped.
#       1: Have single character words. The words may be overlapped.
#       2: No single character words. The words are not overlapped.
#       3: Have single character words. The words are not overlapped.
#       4: Only single characters.
print(sc.extract_words(x, mode=0))      # 分词
['今天', '参加', '工作', '我花', '23.33', '零食', '犒劳', '一下', '自己']

a = "做人真的好难"
b = "做人实在太难了"
print(sc.string_distance(a,b))  # 编辑距离
0.46153846153846156

x = "种族歧视"
print(sc.find_synonyms(x, n=3))  # 同/近义词
[('种族歧视', 1.0), ('种族主义', 0.84619140625), ('歧视', 0.76416015625)]

3. 繁体简体转换

该部分使用 chinese_converter 实现,源码请参考:https://github.com/zachary822/chinese-converter

>> print(sc.to_traditional(x)) # 转换为繁体 烏龜測試123 >>> x = "烏龜測試123" >>> print(sc.to_simplified(x)) # 转换为简体 乌龟测试123 ">
>>> x = "乌龟测试123"
>>> print(sc.to_traditional(x))  # 转换为繁体
烏龜測試123

>>> x = "烏龜測試123"
>>> print(sc.to_simplified(x))   # 转换为简体
乌龟测试123

4. 特征提取和向量化

5. 词云和可视化

TODO:

  1. 句子向量化及句子相似度
  2. 其他特征提取相关工具
Owner
Ming
惊了
Ming
BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

Table of contents Introduction Using BARTpho with fairseq Using BARTpho with transformers Notes BARTpho: Pre-trained Sequence-to-Sequence Models for V

VinAI Research 58 Dec 23, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ÖMER YILDIZ 4 Mar 20, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021

transformers-arithmetic This repository contains the code to reproduce the experiments from the paper: Nogueira, Jiang, Lin "Investigating the Limitat

Castorini 33 Nov 16, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 113 Dec 15, 2022
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

凌逆战 75 Dec 05, 2022
DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

BuyWithCrypto 3 Apr 19, 2022