This is Assignment1 code for the Web Data Processing System.

Last update: Dec 04, 2022

Related tags

Text Data & NLP wdps2126

Overview

First Assignment - Entity Linking

Web Data Processing System Assignment 1 - 2021 - Group 26

Zhining Bai
Bowen Lyu
Tianshi Chen
Yiming Xu

Description

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below:

Read WARC

Use pyspark to read large-scale warc files, so the program supports parallel computing.
Extract text information from HTML files by using beautifulsoup.

Named entity recognition

Extract entities by using recognize_entities_bert model from sparknlp.

Disambiguation and NIL

We considered the popularity of the candidate page as well as the semantic similarity between the sentence where the entity is located and the candidate description to achieve Disambiguation.

Popularity: Calculate popularity rankings using the Elasticsearch scoring algorithm and the number of properties of the mention from the knowledge graph.
Sentence similarity: Measure the difference between text and description using the Levenshtein distance.

NIL: Retain results with distances < 40.

Prerequisites

Codes are run on the DAS cluster at /var/scratch/wdps2106/wdps_2126, result1 is a conda virtual environment that has been created. Below are the packages installed to run the assignment.

# if you want to use pip(pip for python3) to install the packages, use the following command(python version 3.8)
pip install pyspark==3.1.2
pip install spark-nlp==3.3.3
pip install beautifulsoup4
pip install python-Levenshtein
pip install elasticsearch

# if you want to use conda to install the packages, use the following command(recommended)
conda create -n 
   
     python=3.8
conda install pyspark
conda install bs4
conda install elasticsearch
pip install python-Levenshtein
pip install sparknlp

Run

To run the program, you can simply use the command below. The parameter Keyname is the name of page ID in WARC files such as WARC_TREC_ID. You need to declare the name of the page ID using this parameter. Be aware that the result file will be renamed as result.tsv.

sh run.sh /path/to/warc/file.warc.gz /path/to/result/ Keyname

If you use DAS cluster, you also need to add this command before running:

export OPENBLAS_NUM_THREADS=10

To check the score of the result file, use the command below.

python3 score.py /sample/annotation/file/sample.tsv /generated/result/file/result.tsv

Result

We tested our entity linking code using sample.warc.gz. Since sample_annotations.tsv only contains the entities that page_id is less than 92, our test results only output entity links with page_id <= 92. The f1 score of the sample data is 0.1122.

Metric	Value
Gold	500
Predicted	480
Correct	55
Precision	0.1145
Recall	0.11
F1 Score	0.1122

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

21 Aug 12, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 1, 2023

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

1.2k Dec 21, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

19.5k Feb 13, 2021

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

77.3k Jan 3, 2023

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

3k Jan 5, 2023

Releases(wdps)

wdps(Jun 1, 2022)

This is a releas test.
Source code(tar.gz)
Source code(zip)

This is Assignment1 code for the Web Data Processing System.

Related tags

Overview

First Assignment - Entity Linking

Description

Read WARC

Named entity recognition

Disambiguation and NIL

Prerequisites

Run

Result

You might also like...

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Basic Utilities for PyTorch Natural Language Processing (NLP)

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

💫 Industrial-strength Natural Language Processing (NLP) in Python

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

State of the Art Natural Language Processing

Releases(wdps)

wdps(Jun 1, 2022)

Owner

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

The official repository of the ISBI 2022 KNIGHT Challenge

Tools and data for measuring the popularity & growth of various programming languages.

Biterm Topic Model (BTM): modeling topics in short texts

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Yodatranslator is a simple translator English to Yoda-language

Creating a chess engine using GPT-3

Library for fast text representation and classification.

Course project of [email protected]

nlpcommon is a python Open Source Toolkit for text classification.

[ICLR'19] Trellis Networks for Sequence Modeling

AudioCLIP Extending CLIP to Image, Text and Audio

A telegram bot to translate 100+ Languages

Python library for processing Chinese text

Google and Stanford University released a new pre-trained model called ELECTRA

Chinese segmentation library

Predict the spans of toxic posts that were responsible for the toxic label of the posts

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)