TFIDF-based QA system for AIO2 competition

Last update: Feb 19, 2022

Related tags

Overview

AIO2 TF-IDF Baseline

This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition.

In the training stage, the model builds a sparse matrix of TF-IDF features from the questions in training dataset. In the inference stage, the model predicts answers of unseen questions by finding the most similar training question to the input by computing dot product scores of TF-IDF features.

Therefore, in principle, the model cannot predict answers unseen in the training data.

Steps to experiment with the model

Install requirements

$ pip install -r requirements.txt

Train

$ python train.py \
--train_file <data dir>/aio_02_train.jsonl \
--output_dir model \
--pos_list 名詞 \
--stop_words でしょ う \
--max_features 10000

Predict

$ python predict.py \
--model_dir model \
--test_file <data dir>/aio_02_dev_unlabeled_v1.0.jsonl \
--prediction_file <output dir>/predictions.jsonl

Building Docker image

$ docker build -t aio2-tfidf-baseline .

Test locally:

:/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl "> $ docker run --rm -v ":/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl 

Save the docker image to file:

$ docker save aio2-tfidf-baseline | gzip > aio2-tfidf-baseline.tar.gz

License

The codes in this repository are open-sourced under MIT License.

TFIDF-based QA system for AIO2 competition

Related tags

Overview

AIO2 TF-IDF Baseline

Steps to experiment with the model

Install requirements

Train

Predict

Building Docker image

License

Owner

Masatoshi Suzuki

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

This repository contains helper functions which can help you generate additional data points depending on your NLP task.

Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

Python port of Google's libphonenumber

Code and data accompanying Natural Language Processing with PyTorch

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Backend for the Autocomplete platform. An AI assisted coding platform.

A raytrace framework using taichi language

InferSent sentence embeddings

A collection of scripts to preprocess ASR datasets and finetune language-specific Wav2Vec2 XLSR models

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

A simple Streamlit App to classify swahili news into different categories.

Deep Learning Topics with Computer Vision & NLP