Ecommerce product title recognition package

Last update: Mar 03, 2022

Overview

revizor

This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you name it).
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

Ecommerce product title recognition package

Related tags

Overview

revizor

Install

Usage

Boring numbers

Dataset

License

Owner

Bureaucratic Labs

This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Contains descriptions and code of the mini-projects developed in various programming languages

AllenNLP integration for Shiba: Japanese CANINE model

Generate text line images for training deep learning OCR model (e.g. CRNN)

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

a test times augmentation toolkit based on paddle2.0.

keras implement of transformers for humans

nlpcommon is a python Open Source Toolkit for text classification.

Quick insights from Zoom meeting transcripts using Graph + NLP

Data manipulation and transformation for audio signal processing, powered by PyTorch

AI and Machine Learning workflows on Anthos Bare Metal.

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

NumPy String-Indexed is a NumPy extension that allows arrays to be indexed using descriptive string labels

Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

中文生成式预训练模型

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"