The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Last update: Nov 03, 2022

Related tags

Text Data & NLP analogy-language-model

Overview

BERT is to NLP what AlexNet is to CV

This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies? (the camera-ready version of the paper is here) which has been accepted by the ACL 2021 main conference. We evaluate pretrained language models (LM) on five analogy tests that follow SAT-style format as below.

QUERY word:language
OPTION
  (1) paint:portrait
  (2) poetry:rhythm 
  (3) note:music <-- the answer!
  (4) tale:story
  (5) week:year

We devise a new class of scoring functions, referred to as analogical proportion (AP) scores, to solve word analogies in an unsurpervised fashion and investigate the relational knowledge that LM learnt through pretraining.

Please see our paper for more information and discussion.

Get started

git clone https://github.com/asahi417/analogy-language-model
cd analogy-language-model
pip install -e .

Run Experiments

The following scripts reproduce our results in the paper.

# get result for our main AP score
python experiments/experiment_ppl_variants.py 
# get result for word embedding baseline
python experiments/experiment_word_embedding.py 
# get result for other scoring function such as vector difference, etc
python experiments/experiment_scoring_comparison.py

Here's the result summary that can be attained by running those scripts.

experimental results

Dataset

The datasets used in our experiments can be downloaded from the following link:

Analogy Datasets

Please see the Analogy Tool for more information about the dataset and baselines.

Citation

Please cite our reference paper if you use our data or code:

@inproceedings{ushio-etal-2021-bert,
    title = "{BERT} is to {NLP} what {A}lex{N}et is to {CV}: Can Pre-Trained Language Models Identify Analogies?",
    author = "Ushio, Asahi  and
      Espinosa Anke, Luis  and
      Schockaert, Steven  and
      Camacho-Collados, Jose",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.280",
    doi = "10.18653/v1/2021.acl-long.280",
    pages = "3609--3624",
    abstract = "Analogies play a central role in human commonsense reasoning. The ability to recognize analogies such as {``}eye is to seeing what ear is to hearing{''}, sometimes referred to as analogical proportions, shape how we structure knowledge and understand language. Surprisingly, however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we analyze the capabilities of transformer-based language models on this unsupervised task, using benchmarks obtained from educational settings, as well as more commonly used datasets. We find that off-the-shelf language models can identify analogies to a certain extent, but struggle with abstract and complex relations, and results are highly sensitive to model architecture and hyperparameters. Overall the best results were obtained with GPT-2 and RoBERTa, while configurations using BERT were not able to outperform word embedding models. Our results raise important questions for future work about how, and to what extent, pre-trained language models capture knowledge about abstract semantic relations.",
}

Please also cite the relevant reference papers if using any of the analogy datasets.

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 Corpora 📃 Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models 🤖 RoBERTa-base BNE: https://huggingface.co

203 Dec 20, 2022

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

27 Dec 12, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

3.1k Jan 8, 2023

The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

Related tags

Overview

BERT is to NLP what AlexNet is to CV

Get started

Run Experiments

Dataset

Citation

You might also like...

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

The official repository of the ISBI 2022 KNIGHT Challenge

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

Official codebase for Can Wikipedia Help Offline Reinforcement Learning?

SAINT PyTorch implementation

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

Releases(0.0.0)

0.0.0(Apr 29, 2021)

Owner

Asahi Ushio

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Stand-alone language identification system

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

An open source framework for seq2seq models in PyTorch.

A simple implementation of N-gram language model.

Uncomplete archive of files from the European Nopsled Team

Chinese Grammatical Error Diagnosis

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

MiCECo - Misskey Custom Emoji Counter

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Sequence-to-Sequence learning using PyTorch

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

CDLA: A Chinese document layout analysis (CDLA) dataset

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

A script that automatically creates a branch name using google translation api and jira api

A raytrace framework using taichi language