Using BERT-based models for toxic span detection

Overview

Open in Visual Studio Code

SemEval 2021 Task 5: Toxic Spans Detection:

Task:

Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/25623

References:

  1. https://huggingface.co/docs/transformers/training - To understand how to train model.
  2. https://huggingface.co/docs/transformers/model_doc/roberta - To understand Roberta model and corresponding tokenizer
  3. https://huggingface.co/docs/transformers/model_doc/distilbert - To understand DistilBert and corresponding rokeniser
  4. https://github.com/huggingface/transformers/issues/14305 - to understand postprocessing of predicted labels to spans
  5. https://github.com/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb - Copied function tokenize_and_align_labels() from this tutorial notebook from huggingface and followed the certain steps to fine tune model on custom dataset.
  6. https://github.com/ipavlopoulos/toxic_spans/blob/master/evaluation/metrics.py - F1 score function provided by competition is modified to accomodate our model output
Owner
Ravika Nagpal
ML/AI/NLP enthusiast | Java/Scala/Python Developer| Ex- TCS/RBS | University of Alberta
Ravika Nagpal
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ๐Ÿฆ ๐Ÿ‡ฎ๐Ÿ‡ฉ 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
๐Ÿค— The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

๐Ÿค— The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 02, 2023
Shellcode antivirus evasion framework

Schrodinger's Cat Schrodinger'sCat is a Shellcode antivirus evasion framework Technical principle Please visit my blog https://idiotc4t.com/ How to us

idiotc4t 27 Jul 09, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muล‚a 763 Dec 27, 2022
CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

Sajjad Ayoubi 66 Dec 14, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers ์ด ํ”„๋กœ์ ํŠธ๋Š” KoBERT ๋ชจ๋ธ์„ sentence-transformers ์—์„œ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค. Ko-Sentence-BERT-SKTBERT ํ”„๋กœ์ ํŠธ์—์„œ๋Š” KoBERT ๋ชจ๋ธ์„ sentence-trans

Junghyun 40 Dec 20, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022
This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Word-Level Coreference Resolution This is a repository with the code to reproduce the experiments described in the paper of the same name, which was a

79 Dec 27, 2022
FireFlyer Record file format, writer and reader for DL training samples.

FFRecord The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux

77 Jan 04, 2023
A Fast Sequence Transducer Implementation with PyTorch Bindings

transducer A Fast Sequence Transducer Implementation with PyTorch Bindings. The corresponding publication is Sequence Transduction with Recurrent Neur

Awni Hannun 184 Dec 18, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets๋ฅผ ์ด์šฉํ•œ ํ•œ๊ตญ์–ด/ํ•œ๊ธ€ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ์Œ์ž…๋‹ˆ๋‹ค. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023