多语言降噪预训练模型MBart的中文生成任务

Last update: Sep 19, 2022

Related tags

Text Data & NLP mbart-chinese

Overview

mbart-chinese

基于mbart-large-cc25 的中文生成任务

Input

source input: text + </s> + lang_code
target input: lang_code + text + </s>

Usage

token_ids_mapping.json：从全量词表中抽取出的中文字符及高频英文字符，在老新词典中的映射关系表。

Todo

mbart在中文标题生成任务的评测结果

Owner

GitHub Repository

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

60 Sep 26, 2022

Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM（Chinese Pretrained Models）模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型，参数量分别为109M、334M、2.6B，用户需申请与通过审核，方可下载。由于原项目需要考虑大模型的训练和使用，需要安装较为复杂

382 Jan 07, 2023

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

1.2k Jan 08, 2023

Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

193 Dec 30, 2022

Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

0 Apr 09, 2022

DataCLUE: 国内首个以数据为中心的AI测评（含模型分析报告）

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引章节描述简介介绍以数据为中心的AI测评(DataCLUE)的背景任务描述任务描述实验结果

135 Dec 22, 2022

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

191 Dec 26, 2022

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Command-line tools for speech and intent recognition on Linux

988 Jan 04, 2023

Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

14 Dec 12, 2022

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

247 Jan 05, 2023

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

GCRC GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Eva

5 Nov 04, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 01, 2023

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 02, 2023