T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

Overview

T'rex Park(霸王龙公园)

Trexpark项目由有赞数据智能团队开源,是国内首个基于电商大数据训练的开源NLP和图像项目。我们预期将逐步开放基于商品标题,评论,客服对话等NLP语聊,以及商品主图,品牌logo等进行预训练的NLP和图像模型。


为什么是霸王龙?

霸王龙

霸王龙是有赞的吉祥物。呃,准确的说这不是个吉祥物,而是有赞人自我鞭策的精神图腾。早期我们的网站经常崩溃,导致浏览器会显示一个霸王龙的图案,提示页面崩溃了。于是我们就把霸王龙作为我们的吉祥物,让大家时刻警惕故障和缺陷。


为什么要开源模型?

和平台电商不同,有赞是一家商家服务公司,我们的使命是帮助每一位重视产品和服务的商家成功。因此我们放弃了通过开放接口提供服务的方式,直接把底层能力开放出来,提供给需要的商家和中小型电商企业,帮助他们在有赞的数据沉淀基础上,快速构建自己的机器学习应用。


为什么要做领域预训练模型?

目前各个开源大模型往往基于通用语料训练,而通用语料的语言模型用于特定领域的机器学习任务,往往效果不佳,或者需要对预训练模型部分进行finetune。我们的实践发现,基于电商数据finetune以后的预训练模型,能更好的学习到领域知识,并且在多项任务中,无须额外训练,或者仅仅对模型的预测部分进行训练就可以达到很好的效果。

我们基于电商领域语料训练的预训练模型非常适合小样本的机器学习任务,用于解决中小电商企业和商家的fewshot难题。以商品标题分类为例,每个类目只需要100个样本,就能得到很好的分类效果,具体例子可以看这里

我们的模型已经在HuggingFace的model hub上发布,想要使用我们的模型,只需要几行代码

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("youzanai/bert-product-title-chinese")
model = AutoModel.from_pretrained("youzanai/bert-product-title-chinese")

模型加载后,我们就可以执行简单的encoder任务了

batch = tokenizer(["青蒿精油手工皂", "超级飞侠乐迪太空车"])
outputs = model(**batch)
print(outputs.logits)

项目的src目录中有完整的代码和测试用的数据,可以直接运行浏览效果。


文档和帮助

详细的使用文档我们还在编写中,大家可以先参考src目录中的示例代码。为了让代码更容易理解,我们已经尽可能的对代码进行了精简。T'rex Park底层使用了HuggingFace的Transformer框架,关于Transformer的文档可以看这里

MMDA - multimodal document analysis

MMDA - multimodal document analysis

AI2 75 Jan 04, 2023
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

172 Dec 29, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Counterfactual Attention Learning Created by Yongming Rao*, Guangyi Chen*, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for ICCV

Yongming Rao 89 Dec 18, 2022
Python3 to Crystal Translation using Python AST Walker

py2cr.py A code translator using AST from Python to Crystal. This is basically a NodeVisitor with Crystal output. See AST documentation (https://docs.

66 Jul 25, 2022
Train and use generative text models in a few lines of code.

blather Train and use generative text models in a few lines of code. To see blather in action check out the colab notebook! Installation Use the packa

Dan Carroll 16 Nov 07, 2022
An easy to use, user-friendly and efficient code for extracting OpenAI CLIP (Global/Grid) features from image and text respectively.

Extracting OpenAI CLIP (Global/Grid) Features from Image and Text This repo aims at providing an easy to use and efficient code for extracting image &

Jianjie(JJ) Luo 13 Jan 06, 2023
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
AudioCLIP Extending CLIP to Image, Text and Audio

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 02, 2023
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count how many files,-1 for all

1 Jan 11, 2022
Gold standard corpus annotated with verb-preverb connections for Hungarian.

Hungarian Preverb Corpus A gold standard corpus manually annotated with verb-preverb connections for Hungarian. corpus The corpus consist of the follo

RIL Lexical Knowledge Representation Research Group 3 Jan 27, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022