The aim of this task is to predict someone's English proficiency based on a text input.

Last update: Dec 13, 2021

Overview

English_proficiency_prediction_NLP

The aim of this task is to predict someone's English proficiency based on a text input.

Using the The NICT JLE Corpus available here : https://alaginrc.nict.go.jp/nict_jle/index_E.html

The source of the corpus data is the transcripts of the audio-recorded speech samples of 1,281 participants (1.2 million words, 300 hours in total) of English oral proficiency interview test. Each participant got a SST (Standard Speaking Test) score between 1 (low proficiency) and 9 (high proficiency) based on this test.

The goal is to build a machine learning algorithm for predicting the SST score of each participant based on their transcript.

Steps:

1 - Pre-process the dataset: extract the participant transcript (all tags). Inside participant transcript, you can remove all other tags and extract only English words.

2 - Process the dataset: extract features with the Bag of Word (BoW) technique

3 - Train a classifier to predict the SST score

4 - Compute the accuracy of your system (the number of participant classified correctly) and plot the confusion matrix.

5 - Try to improve your system (for example you can try to use GloVe instead of BoW).

The aim of this task is to predict someone's English proficiency based on a text input.

Related tags

Overview

English_proficiency_prediction_NLP

Owner

BiNE: Bipartite Network Embedding

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

NLP - Machine learning

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Sequence-to-Sequence Framework in PyTorch

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

Espial is an engine for automated organization and discovery of personal knowledge

PUA Programming Language written in Python.

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Text vectorization tool to outperform TFIDF for classification tasks

Unsupervised text tokenizer focused on computational efficiency

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Ongoing research training transformer language models at scale, including: BERT & GPT-2

NLP project that works with news (NER, context generation, news trend analytics)

Generate a cool README/About me page for your Github Profile