T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Overview

T'rex Park(霸王龙公园)

Trexpark项目由有赞数据智能团队开源,是国内首个基于电商大数据训练的开源NLP和图像项目。我们预期将逐步开放基于商品标题,评论,客服对话等NLP语聊,以及商品主图,品牌logo等进行预训练的NLP和图像模型。


为什么是霸王龙?

霸王龙

霸王龙是有赞的吉祥物。呃,准确的说这不是个吉祥物,而是有赞人自我鞭策的精神图腾。早期我们的网站经常崩溃,导致浏览器会显示一个霸王龙的图案,提示页面崩溃了。于是我们就把霸王龙作为我们的吉祥物,让大家时刻警惕故障和缺陷。


为什么要开源模型?

和平台电商不同,有赞是一家商家服务公司,我们的使命是帮助每一位重视产品和服务的商家成功。因此我们放弃了通过开放接口提供服务的方式,直接把底层能力开放出来,提供给需要的商家和中小型电商企业,帮助他们在有赞的数据沉淀基础上,快速构建自己的机器学习应用。


为什么要做领域预训练模型?

目前各个开源大模型往往基于通用语料训练,而通用语料的语言模型用于特定领域的机器学习任务,往往效果不佳,或者需要对预训练模型部分进行finetune。我们的实践发现,基于电商数据finetune以后的预训练模型,能更好的学习到领域知识,并且在多项任务中,无须额外训练,或者仅仅对模型的预测部分进行训练就可以达到很好的效果。

我们基于电商领域语料训练的预训练模型非常适合小样本的机器学习任务,用于解决中小电商企业和商家的fewshot难题。以商品标题分类为例,每个类目只需要100个样本,就能得到很好的分类效果,具体例子可以看这里

我们的模型已经在HuggingFace的model hub上发布,想要使用我们的模型,只需要几行代码

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("youzanai/bert-product-title-chinese")
model = AutoModel.from_pretrained("youzanai/bert-product-title-chinese")

模型加载后,我们就可以执行简单的encoder任务了

batch = tokenizer(["青蒿精油手工皂", "超级飞侠乐迪太空车"])
outputs = model(**batch)
print(outputs.logits)

项目的src目录中有完整的代码和测试用的数据,可以直接运行浏览效果。


文档和帮助

详细的使用文档我们还在编写中,大家可以先参考src目录中的示例代码。为了让代码更容易理解,我们已经尽可能的对代码进行了精简。T'rex Park底层使用了HuggingFace的Transformer框架,关于Transformer的文档可以看这里

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

Charly Wargnier 9 Jan 01, 2023
Azure Text-to-speech service for Home Assistant

Azure Text-to-speech service for Home Assistant The Azure text-to-speech platform uses online Azure Text-to-Speech cognitive service to read a text wi

Yassine Selmi 2 Aug 06, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
C.J. Hutto 3.8k Dec 30, 2022
Mysticbbs-rjam - rJAM splitscreen message reader for MysticBBS A46+

rJAM splitscreen message reader for MysticBBS A46+

Robbert Langezaal 4 Nov 22, 2022
Yet Another Sequence Encoder - Encode sequences to vector of vector in python !

Yase Yet Another Sequence Encoder - encode sequences to vector of vectors in python ! Why Yase ? Yase enable you to encode any sequence which can be r

Pierre PACI 12 Aug 19, 2021
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022
A Fast Sequence Transducer Implementation with PyTorch Bindings

transducer A Fast Sequence Transducer Implementation with PyTorch Bindings. The corresponding publication is Sequence Transduction with Recurrent Neur

Awni Hannun 184 Dec 18, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

Neural Text Matching Community 3.7k Jan 02, 2023
Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

THUNLP-MT 9 Jun 27, 2022
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022