Contact Extraction with Question Answering.

Last update: Apr 20, 2022

Related tags

Text Data & NLP contactsQA

Overview

contactsQA

Extraction of contact entities from address blocks and imprints with Extractive Question Answering.

Goal

Input:

Dr. Max Mustermann
Hauptstraße 123
97070 Würzburg

Output:

entities = {
  "city" : "Würzburg",
  "email" : "",
  "fax" : "",
  "firstName" : "Max",
  "lastName" : "Mustermann",
  "mobile" : "",
  "organization" : "",
  "phone" : "",
  "position" : "",
  "street" : "Hauptstraße 123",
  "title" : "Dr.",
  "website" : "",
  "zip" : "97070"
}

Getting started

Creating a dataset

Due to data protection reasons, no dataset is included in this repository. You need to create a dataset in the SQuAD format, see https://huggingface.co/datasets/squad. Create the dataset in the jsonl-format where one line looks like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'firstName',
        'answers': {
            'answer_start': [4],
            'text': ['Max']
        }
    }

Questions with no answers should look like this:

    {
        'id': '123',
        'title': 'mustermanns address',
        'context': 'Meine Adresse ist folgende: \n\nDr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg \n Schicken Sie mir bitte die Rechnung zu.',
        'fixed': 'Dr. Max Mustermann \nHauptstraße 123 \n97070 Würzburg',
        'question': 'phone',
        'answers': {
            'answer_start': [-1],
            'text': ['EMPTY']
        }
    }

Split the dataset into a train-, validation- and test-dataset and save them in a directory with the name crawl, email or expected, like this:

├── data
│   ├── crawl
│   │   ├── crawl-test.jsonl
│   │   ├── crawl-train.jsonl
│   │   ├── crawl-val.jsonl

If you allow unanswerable questions like in SQuAD v2.0, add a -na behind the directory name, like this:

├── data
│   ├── crawl-na
│   │   ├── crawl-na-test.jsonl
│   │   ├── crawl-na-train.jsonl
│   │   ├── crawl-na-val.jsonl

Training a model

Example command for training and evaluating a dataset inside the crawl-na directory:

python app/qa-pipeline.py \
--batch_size 4 \
--checkpoint xlm-roberta-base \
--dataset_name crawl \
--dataset_path="../data/" \
--deactivate_map_caching \
--doc_stride 128 \
--epochs 3 \
--gpu_device 0 \
--learning_rate 0.00002 \
--max_answer_length 30 \
--max_length 384 \
--n_best_size 20 \
--n_jobs 8 \
--no_answers \
--overwrite_output_dir;

Virtual Environment Setup

Create and activate the environment (the python version and the environment name can vary at will):

$ python3.9 -m venv .env
$ source .env/bin/activate

To install the project's dependencies, activate the virtual environment and simply run (requires poetry):

$ poetry install

Alternatively, use the following:

$ pip install -r requirements.txt

Deactivate the environment:

$ deactivate

Troubleshooting

Common error:

ModuleNotFoundError: No module named 'setuptools'

The solution is to upgrade setuptools and then run poetry install or poetry update afterwards:

pip install --upgrade setuptools

Contact Extraction with Question Answering.

Related tags

Overview

contactsQA

Goal

Getting started

Creating a dataset

Training a model

Virtual Environment Setup

Troubleshooting

Owner

Jan

Partially offline multi-language translator built upon Huggingface transformers.

Nested Named Entity Recognition for Chinese Biomedical Text

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Two-stage text summarization with BERT and BART

a chinese segment base on crf

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

CredData is a set of files including credentials in open source projects

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

结巴中文分词

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.

SGMC: Spectral Graph Matrix Completion

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

🎐 a python library for doing approximate and phonetic matching of strings.

official ( API ) for the zAmericanEnglish app in [ Google play ] and [ App store ]

A programming language with logic of Python, and syntax of all languages.

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Code examples for my Write Better Python Code series on YouTube.

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts