EmoBERT-MLOps

The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this project have some differences on design, tools and frameworks used, with the objective to practice and give a different angle and implementation to the original course.

This project uses a BERT model for emotion classification and is based on the GoEmotions dataset.

Content list

TODO

Dataset descrition

Taken from https://ai.googleblog.com/2021/10/goemotions-dataset-for-fine-grained.html

In “GoEmotions: A Dataset of Fine-Grained Emotions”, we describe GoEmotions, a human-annotated dataset of 58k Reddit comments extracted from popular English-language subreddits and labeled with 27 emotion categories. As the largest fully annotated English language fine-grained emotion dataset to date, we designed the GoEmotions taxonomy with both psychology and data applicability in mind. In contrast to the basic six emotions, which include only one positive emotion (joy), our taxonomy includes 12 positive, 11 negative, 4 ambiguous emotion categories and 1 “neutral”, making it widely suitable for conversation understanding tasks that require a subtle differentiation between emotion expressions.

Model descrition

TODO

End-to-end MLOps pipeline of a BERT model for emotion classification.

Related tags

Overview

EmoBERT-MLOps

Content list

Dataset descrition

Model descrition

Owner

Dimitre Oliveira

Code for the Python code smells video on the ArjanCodes channel.

TFIDF-based QA system for AIO2 competition

PyTorch implementation of Tacotron speech synthesis model.

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pytorch version of BERT-whitening

Watson Natural Language Understanding and Knowledge Studio

Perform sentiment analysis and keyword extraction on Craigslist listings

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Skipgram Negative Sampling in PyTorch

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

CDLA: A Chinese document layout analysis (CDLA) dataset

ACL'2021: Learning Dense Representations of Phrases at Scale

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Backend for the Autocomplete platform. An AI assisted coding platform.

A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT