Language-Agnostic Website Embedding and Classification

Last update: Dec 27, 2022

Related tags

Overview

Homepage2Vec

Language-Agnostic Website Embedding and Classification based on Curlie labels https://arxiv.org/pdf/2201.03677.pdf

Homepage2Vec is a pre-trained model that supports the classification and embedding of websites starting from their homepage.

Left: Projection in two dimensions with t-SNE of the embedding of 5K random samples of the testing set. Colors represent the 14 classes. Right: The projection with t-SNE of some popular websites shows that embedding vectors effectively capture website topics.

Curated Curlie Dataset

We release the full training dataset obtained from Curlie. The dataset includes the websites (online in April 2021) with the URL recognized as homepage, and it contains the original labels, the labels aligned to English, and the fetched HTML pages.

Get it here: https://doi.org/10.6084/m9.figshare.16621669

Getting started with the library

Installation:

Step 1: install the library with pip.

pip install homepage2vec

Usage:

import logging
from homepage2vec.model import WebsiteClassifier

logging.getLogger().setLevel(logging.DEBUG)

model = WebsiteClassifier()

website = model.fetch_website('epfl.ch')

scores, embeddings = model.predict(website)

print("Classes probabilities:", scores)
print("Embedding:", embeddings)

Result:

Classes probabilities: {'Arts': 0.3674524128437042, 'Business': 0.0720655769109726,
 'Computers': 0.03488553315401077, 'Games': 7.529282356699696e-06, 
 'Health': 0.02021787129342556, 'Home': 0.0005890956381335855, 
 'Kids_and_Teens': 0.3113572597503662, 'News': 0.0079914266243577, 
 'Recreation': 0.00835705827921629, 'Reference': 0.931416392326355, 
 'Science': 0.959597110748291, 'Shopping': 0.0010162043618038297, 
 'Society': 0.23374591767787933, 'Sports': 0.00014659571752417833}
 
Embedding: [-4.596550941467285, 1.0690114498138428, 2.1633379459381104,
 0.1665923148393631, -4.605356216430664, -2.894961357116699, 0.5615459084510803, 
 1.6420538425445557, -1.918184757232666, 1.227172613143921, 0.4358430504798889, 
 ...]

The library automatically downloads the pre-trained models homepage2vec and XLM-R at the first usage.

Using visual features

If you wish to use the prediction using the visual features, Homepage2vec needs to take a screenshot of the website. This means you need a working copy of Selenium and the Chrome browser. Please note that as reported in the reference paper, the performance improvement is limited.

Install the Selenium Chrome web driver, and add the folder to the system $PATH variable. You need a local copy of Chrome browser (See Getting started).

Getting involved

We invite contributions to Homepage2Vec! Please open a pull request if you have any suggestions.

Original publication

Language-Agnostic Website Embedding and Classification

Sylvain Lugeon, Tiziano Piccardi, Robert West

Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie, the largest multilingual crowdsourced Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and can generate embeddings representation. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources.

https://arxiv.org/pdf/2201.03677.pdf

Dataset License

Creative Commons Attribution 3.0 Unported License - Curlie

Learn more how to contribute: https://curlie.org/docs/en/about.html

Language-Agnostic Website Embedding and Classification

Related tags

Overview

Homepage2Vec

Curated Curlie Dataset

Getting started with the library

Installation:

Usage:

Using visual features

Getting involved

Original publication

Dataset License

Owner

CoaT: Co-Scale Conv-Attentional Image Transformers

Official repository for "On Generating Transferable Targeted Perturbations" (ICCV 2021)

CaFM-pytorch ICCV ACCEPT Introduction of dataset VSD4K

Text Extraction Formulation + Feedback Loop for state-of-the-art WSD (EMNLP 2021)

Implementation of character based convolutional neural network

DeepCO3: Deep Instance Co-segmentation by Co-peak Search and Co-saliency

Tensorflow 2 implementation of the paper: Learning and Evaluating Representations for Deep One-class Classification published at ICLR 2021

Object tracking using YOLO and a tracker(KCF, MOSSE, CSRT) in openCV

Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion

Semi-automated OpenVINO benchmark_app with variable parameters

Cross-platform CLI tool to generate your Github profile's stats and summary.

git《Tangent Space Backpropogation for 3D Transformation Groups》(CVPR 2021) GitHub:1]

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Code for Boundary-Aware Segmentation Network for Mobile and Web Applications

A high-performance Python-based I/O system for large (and small) deep learning problems, with strong support for PyTorch.

Instantaneous Motion Generation for Robots and Machines.

Lightweight plotting to the terminal. 4x resolution via Unicode.

It is modified Tensorflow 2.x version of Mask R-CNN

Pytorch implementation of U-Net, R2U-Net, Attention U-Net, and Attention R2U-Net.

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.