This is Assignment1 code for the Web Data Processing System.

Last update: Dec 04, 2022

Related tags

Text Data & NLP wdps2126

Overview

First Assignment - Entity Linking

Web Data Processing System Assignment 1 - 2021 - Group 26

Zhining Bai
Bowen Lyu
Tianshi Chen
Yiming Xu

Description

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below:

Read WARC

Use pyspark to read large-scale warc files, so the program supports parallel computing.
Extract text information from HTML files by using beautifulsoup.

Named entity recognition

Extract entities by using recognize_entities_bert model from sparknlp.

Disambiguation and NIL

We considered the popularity of the candidate page as well as the semantic similarity between the sentence where the entity is located and the candidate description to achieve Disambiguation.

Popularity: Calculate popularity rankings using the Elasticsearch scoring algorithm and the number of properties of the mention from the knowledge graph.
Sentence similarity: Measure the difference between text and description using the Levenshtein distance.

NIL: Retain results with distances < 40.

Prerequisites

Codes are run on the DAS cluster at /var/scratch/wdps2106/wdps_2126, result1 is a conda virtual environment that has been created. Below are the packages installed to run the assignment.

# if you want to use pip(pip for python3) to install the packages, use the following command(python version 3.8)
pip install pyspark==3.1.2
pip install spark-nlp==3.3.3
pip install beautifulsoup4
pip install python-Levenshtein
pip install elasticsearch

# if you want to use conda to install the packages, use the following command(recommended)
conda create -n 
   
     python=3.8
conda install pyspark
conda install bs4
conda install elasticsearch
pip install python-Levenshtein
pip install sparknlp

Run

To run the program, you can simply use the command below. The parameter Keyname is the name of page ID in WARC files such as WARC_TREC_ID. You need to declare the name of the page ID using this parameter. Be aware that the result file will be renamed as result.tsv.

sh run.sh /path/to/warc/file.warc.gz /path/to/result/ Keyname

If you use DAS cluster, you also need to add this command before running:

export OPENBLAS_NUM_THREADS=10

To check the score of the result file, use the command below.

python3 score.py /sample/annotation/file/sample.tsv /generated/result/file/result.tsv

Result

We tested our entity linking code using sample.warc.gz. Since sample_annotations.tsv only contains the entities that page_id is less than 92, our test results only output entity links with page_id <= 92. The f1 score of the sample data is 0.1122.

Metric	Value
Gold	500
Predicted	480
Correct	55
Precision	0.1145
Recall	0.11
F1 Score	0.1122

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

21 Aug 12, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 1, 2023

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

1.2k Dec 21, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

19.5k Feb 13, 2021

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

77.3k Jan 3, 2023

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

3k Jan 5, 2023

Releases(wdps)

wdps(Jun 1, 2022)

This is a releas test.
Source code(tar.gz)
Source code(zip)

This is Assignment1 code for the Web Data Processing System.

Related tags

Overview

First Assignment - Entity Linking

Description

Read WARC

Named entity recognition

Disambiguation and NIL

Prerequisites

Run

Result

You might also like...

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Basic Utilities for PyTorch Natural Language Processing (NLP)

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

💫 Industrial-strength Natural Language Processing (NLP) in Python

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

State of the Art Natural Language Processing

Releases(wdps)

wdps(Jun 1, 2022)

Owner

Summarization module based on KoBART

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Implementation of Multistream Transformers in Pytorch

Japanese NLP Library

A fast and easy implementation of Transformer with PyTorch.

Stand-alone language identification system

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Simple program that translates the name of files into English

基于pytorch+bert的中文事件抽取

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

NL. The natural language programming language.

Count the frequency of letters or words in a text file and show a graph.

Minimal GUI for accessing the Watson Text to Speech service.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer