Contact Extraction with Question Answering.

Last update: Apr 20, 2022

Related tags

Text Data & NLP contactsQA

Overview

contactsQA

Extraction of contact entities from address blocks and imprints with Extractive Question Answering.

Goal

Input:

Dr. Max Mustermann
Hauptstraße 123
97070 Würzburg

Output:

entities = {
  "city" : "Würzburg",
  "email" : "",
  "fax" : "",
  "firstName" : "Max",
  "lastName" : "Mustermann",
  "mobile" : "",
  "organization" : "",
  "phone" : "",
  "position" : "",
  "street" : "Hauptstraße 123",
  "title" : "Dr.",
  "website" : "",
  "zip" : "97070"
}

Getting started

Creating a dataset

Due to data protection reasons, no dataset is included in this repository. You need to create a dataset in the SQuAD format, see https://huggingface.co/datasets/squad. Create the dataset in the jsonl-format where one line looks like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'firstName',
        'answers': {
            'answer_start': [4],
            'text': ['Max']
        }
    }

Questions with no answers should look like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'phone',
        'answers': {
            'answer_start': [-1],
            'text': ['EMPTY']
        }
    }

Split the dataset into a train-, validation- and test-dataset and save them in a directory with the name crawl, email or expected, like this:

├── data
│   ├── crawl
│   │   ├── crawl-test.jsonl
│   │   ├── crawl-train.jsonl
│   │   ├── crawl-val.jsonl

If you allow unanswerable questions like in SQuAD v2.0, add a -na behind the directory name, like this:

├── data
│   ├── crawl-na
│   │   ├── crawl-na-test.jsonl
│   │   ├── crawl-na-train.jsonl
│   │   ├── crawl-na-val.jsonl

Training a model

Example command for training and evaluating a dataset inside the crawl-na directory:

python app/qa-pipeline.py \
--batch_size 4 \
--checkpoint xlm-roberta-base \
--dataset_name crawl \
--dataset_path="../data/" \
--deactivate_map_caching \
--doc_stride 128 \
--epochs 3 \
--gpu_device 0 \
--learning_rate 0.00002 \
--max_answer_length 30 \
--max_length 384 \
--n_best_size 20 \
--n_jobs 8 \
--no_answers \
--overwrite_output_dir;

Virtual Environment Setup

Create and activate the environment (the python version and the environment name can vary at will):

$ python3.9 -m venv .env
$ source .env/bin/activate

To install the project's dependencies, activate the virtual environment and simply run (requires poetry):

$ poetry install

Alternatively, use the following:

$ pip install -r requirements.txt

Deactivate the environment:

$ deactivate

Troubleshooting

Common error:

ModuleNotFoundError: No module named 'setuptools'

The solution is to upgrade setuptools and then run poetry install or poetry update afterwards:

pip install --upgrade setuptools

Contact Extraction with Question Answering.

Related tags

Overview

contactsQA

Goal

Getting started

Creating a dataset

Training a model

Virtual Environment Setup

Troubleshooting

Owner

Jan

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project

A BERT-based reverse dictionary of Korean proverbs

A collection of GNN-based fake news detection models.

AllenNLP integration for Shiba: Japanese CANINE model

Contains links to publicly available datasets for modeling health outcomes using speech and language.

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Trained T5 and T5-large model for creating keywords from text

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Implementation of ProteinBERT in Pytorch

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

多语言降噪预训练模型MBart的中文生成任务

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

A workshop with several modules to help learn Feast, an open-source feature store

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

Understand Text Summarization and create your own summarizer in python

Python library for parsing resumes using natural language processing and machine learning

基于pytorch_rnn的古诗词生成