Semantic Scholar's Author Disambiguation Algorithm & Evaluation Suite

Related tags

Deep LearningS2AND
Overview

S2AND

This repository provides access to the S2AND dataset and S2AND reference model described in the paper S2AND: A Benchmark and Evaluation System for Author Name Disambiguation by Shivashankar Subramanian, Daniel King, Doug Downey, Sergey Feldman (https://arxiv.org/abs/2103.07534).

The reference model will be live on semanticscholar.org later this year, but the trained model is available now as part of the data download (see below).

Installation

To install this package, run the following:

git clone https://github.com/allenai/S2AND.git
cd S2AND
conda create -y --name s2and python==3.7
conda activate s2and
pip install -r requirements.in
pip install -e .

To obtain the training data, run this command after the package is installed (from inside the S2AND directory):
[Expected download size is: 50.4 GiB]

aws s3 sync --no-sign-request s3://ai2-s2-research-public/s2and-release data/

If you run into cryptic errors about GCC on macOS while installing the requirments, try this instead:

CFLAGS='-stdlib=libc++' pip install -r requirements.in

Configuration

Modify the config file at data/path_config.json. This file should look like this

{
    "main_data_dir": "absolute path to wherever you downloaded the data to",
    "internal_data_dir": "ignore this one unless you work at AI2"
}

As the dummy file says, main_data_dir should be set to the location of wherever you downloaded the data to, and internal_data_dir can be ignored, as it is used for some scripts that rely on unreleased data, internal to Semantic Scholar.

How to use S2AND for loading data and training a model

Once you have downloaded the datasets, you can go ahead and load up one of them:

from os.path import join
from s2and.data import ANDData

dataset_name = "pubmed"
parent_dir = "data/pubmed/
dataset = ANDData(
    signatures=join(parent_dir, f"{dataset_name}_signatures.json"),
    papers=join(parent_dir, f"{dataset_name}_papers.json"),
    mode="train",
    specter_embeddings=join(parent_dir, f"{dataset_name}_specter.pickle"),
    clusters=join(parent_dir, f"{dataset_name}_clusters.json"),
    block_type="s2",
    train_pairs_size=100000,
    val_pairs_size=10000,
    test_pairs_size=10000,
    name=dataset_name,
    n_jobs=8,
)

This may take a few minutes - there is a lot of text pre-processing to do.

The first step in the S2AND pipeline is to specify a featurizer and then train a binary classifier that tries to guess whether two signatures are referring to the same person.

We'll do hyperparameter selection with the validation set and then get the test area under ROC curve.

Here's how to do all that:

from s2and.model import PairwiseModeler
from s2and.featurizer import FeaturizationInfo
from s2and.eval import pairwise_eval

featurization_info = FeaturizationInfo()
# the cache will make it faster to train multiple times - it stores the features on disk for you
train, val, test = featurize(dataset, featurization_info, n_jobs=8, use_cache=True)
X_train, y_train = train
X_val, y_val = val
X_test, y_test = test

# calibration fits isotonic regression after the binary classifier is fit
# monotone constraints help the LightGBM classifier behave sensibly
pairwise_model = PairwiseModeler(
    n_iter=25, calibrate=True, monotone_constraints=featurization_info.lightgbm_monotone_constraints
)
# this does hyperparameter selection, which is why we need to pass in the validation set.
pairwise_model.fit(X_train, y_train, X_val, y_val)

# this will also dump a lot of useful plots (ROC, PR, SHAP) to the figs_path
pairwise_metrics = pairwise_eval(X_test, y_test, pairwise_model.classifier, figs_path='figs/', title='example')
print(pairwise_metrics)

The second stage in the S2AND pipeline is to tune hyperparameters for the clusterer on the validation data and then evaluate the full clustering pipeline on the test blocks.

We use agglomerative clustering as implemented in fastcluster with average linkage. There is only one hyperparameter to tune.

from s2and.model import Clusterer, FastCluster
from hyperopt import hp

clusterer = Clusterer(
    featurization_info,
    pairwise_model,
    cluster_model=FastCluster(linkage="average"),
    search_space={"eps": hp.uniform("eps", 0, 1)},
    n_iter=25,
    n_jobs=8,
)
clusterer.fit(dataset)

# the metrics_per_signature are there so we can break out the facets if needed
metrics, metrics_per_signature = cluster_eval(dataset, clusterer)
print(metrics)

For a fuller example, please see the transfer script: scripts/transfer_experiment.py.

How to use S2AND for predicting with a saved model

Assuming you have a clusterer already fit, you can dump the model to disk like so

import pickle

with open("saved_model.pkl", "wb") as _pkl_file:
    pickle.dump(clusterer, _pkl_file)

You can then reload it, load a new dataset, and run prediction

import pickle

with open("saved_model.pkl", "rb") as _pkl_file:
    clusterer = pickle.load(_pkl_file)

anddata = ANDData(
    signatures=signatures,
    papers=papers,
    specter_embeddings=paper_embeddings,
    name="your_name_here",
    mode="inference",
    block_type="s2",
)
pred_clusters, pred_distance_matrices = clusterer.predict(anddata.get_blocks(), anddata)

Our released models are in the s3 folder referenced above, and are called production_model.pickle and full_union_seed_*.pickle. They can be loaded the same way, except that the pickled object is a dictionary, with a clusterer key.

Incremental prediction

There is a also a predict_incremental function on the Clusterer, that allows prediction for just a small set of new signatures. When instantiating ANDData, you can pass in cluster_seeds, which will be used instead of model predictions for those signatures. If you call predict_incremental, the full distance matrix will not be created, and the new signatures will simply be assigned to the cluster they have the lowest average distance to, as long as it is below the model's eps, or separately reclustered with the other unassigned signatures, if not within eps of any existing cluster.

Reproducibility

The experiments in the paper were run with the python (3.7.9) package versions in paper_experiments_env.txt. You can install these packages exactly by running pip install pip==21.0.0 and then pip install -r paper_experiments_env.txt --use-feature=fast-deps --use-deprecated=legacy-resolver. Rerunning on the branch s2and_paper should produce the same numbers as in the paper (we will udpate here if this becomes not true).

Licensing

The code in this repo is released under the Apache 2.0 license (license included in the repo. The dataset is released under ODC-BY (included in S3 bucket with the data). We would also like to acknowledge that some of the affiliations data comes directly from the Microsoft Academic Graph (https://aka.ms/msracad).

Citation

@misc{subramanian2021s2and, title={S2AND: A Benchmark and Evaluation System for Author Name Disambiguation}, author={Shivashankar Subramanian and Daniel King and Doug Downey and Sergey Feldman}, year={2021}, eprint={2103.07534}, archivePrefix={arXiv}, primaryClass={cs.DL} }

Comments
  • Find some wrong labels in dataset?

    Find some wrong labels in dataset?

    For example, in Pubmed dataset, in "clusters.json" file, There is a cluster “PM_352”: ['18834', '18835', '18836', '18837', '18838', '18839', '18840', '18841']. But I checked from "signatures.json", since '18834' in given_block "z zhang" while '18836' is in given_block "d zhang", how could they be in a same cluster? Is anything I misunderstand?

    opened by hapoyige 15
  • Add extra name incompatibility check

    Add extra name incompatibility check

    This PR attempts prevent new name incompatibilities from being added to a cluster. So if a claimed cluster contains S Govender and Sharlene Govender, s2and might break that claimed cluster up into two, and then attach Suendharan Govender to the S Govender piece, and then we we remerge, we have a cluster with S Govender, Sharlene Govender, and Suendharan Govender. I suspect this is the issue behind https://github.com/allenai/scholar/issues/27801#issuecomment-847397953, but did not verify that.

    opened by dakinggg 5
  • Question: Can predictions run in multi-core

    Question: Can predictions run in multi-core

    I see that the current implementation of the prediction using the production model will run on a single-core which is very slow when working with larger datasets. I was wondering if there is some already explored way of doing this using multiple cores if not a GPU?

    opened by jinamshah 4
  • global_dataset trick not working?

    global_dataset trick not working?

    @dakinggg I've got a branch going to make S2AND work for paper deduplication. I haven't really messed with your global_dataset trick (I think), but now it stopped working if n_jobs > 1. Works fine for when it's in serial.

    Test fails with FAILED tests/test_featurizer.py::TestData::test_featurizer - NameError: name 'global_dataset' is not defined

    Did you run into this when making it work originally? Any ideas?

    opened by sergeyf 2
  • Question: How does one go about converting their own dataset to the one used for training

    Question: How does one go about converting their own dataset to the one used for training

    Hello, I understand that this is not technically an issue but I just want to understand how to convert a dataset of my own ( that has information like the research paper name, the authors' details like name,affiliation, email id etc) to a dataset that can be consumed for training from scratch.

    opened by jinamshah 2
  • No cluster.json in the medline dataset.

    No cluster.json in the medline dataset.

    I found that medline dataset does not contain "medline_cluster.json" file which prevents me to reproduce the results. Please add the cluster.json file to S2AND.

    opened by skojaku 1
  • Link to evaluation dataset

    Link to evaluation dataset

    Thank you for this excellent open AND-algorithm and data!

    I followed the link from the paper to this repository, but I was not able to find the S2AND dataset. Could you add some help to the readme, please?

    opened by tomthe 1
  • Update readme.md

    Update readme.md

    I added intro language and it includes reference to the saved models. Are these uploaded already? If so, can you add a commit somewhere in the readme about it and maybe a short example about how to load them?

    opened by sergeyf 1
  • Thanks for the reply! Actually, I am using the dataset to do a clustering task so the divided block and cluster labels matter. Here comes another confusion for me, which is, I thought

    Thanks for the reply! Actually, I am using the dataset to do a clustering task so the divided block and cluster labels matter. Here comes another confusion for me, which is, I thought "block" came from original source data and "given_block" is the modified version of S2AND since the number of statistics matches the #Block in TableⅡ in the S2AND paper. Any suggestions?

    Thanks for the reply! Actually, I am using the dataset to do a clustering task so the divided block and cluster labels matter. Here comes another confusion for me, which is, I thought "block" came from original source data and "given_block" is the modified version of S2AND since the number of statistics matches the #Block in TableⅡ in the S2AND paper. Any suggestions?

    Originally posted by @hapoyige in https://github.com/allenai/S2AND/issues/25#issuecomment-1046418074

    opened by hapoyige 0
  • Be more explicit about use_cache to avoid

    Be more explicit about use_cache to avoid

    Zhipeng and the SPECTER+ team missed the cache specification and were debugging for a long time. These changes should hopefully make the cache easier to understand and notice.

    opened by sergeyf 0
  • Incremental bug

    Incremental bug

    Fixes an issue with the incremental code clustering where we were not splitting claimed profiles properly to align with the expected s2and output. The result was the incompatible clusters resulting from claims remained incompatible, and new mentions could not be assigned to them.

    opened by dakinggg 0
  • Future improvements

    Future improvements

    • [ ] Unify the set of languages between cld2 and fasttext (see unify_lang branch for a start)
    • [ ] Audit the list of name pairs (noticed (maria, mary), (kathleen, katherine))
    • [ ] Generally improve language detection on titles (would require a whole model)
    • [ ] if a person has two very disjoint "personas", they will end up as two clusters. Probably not resolvable, but putting here anyway
    • [ ] somehow do better with low information papers (e.g. no abstract, venue, affiliation, references)
    opened by dakinggg 0
Releases(v1.1_no_refs)
공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다.

ObsCare_Main 소개 공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다. CCTV의 대수가 급격히 늘어나면서 관리와 효율성 문제와 더불어, 곳곳에 설치된 CCTV를 개별 관제하는 것으로는 응급 상

5 Jul 07, 2022
Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition - NeurIPS2021

Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition Project Page | Video | Paper Implementation for Neural-PIL. A novel method wh

Computergraphics (University of Tübingen) 64 Dec 29, 2022
A distributed deep learning framework that supports flexible parallelization strategies.

FlexFlow FlexFlow is a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization stra

528 Dec 25, 2022
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

DSEE Codes for [Preprint] DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models Xuxi Chen, Tianlong Chen, Yu Cheng, Weizhu Ch

VITA 4 Dec 27, 2021
Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

🍐 quince Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding 🍐 Installation $ git clone

Andrew Jesson 19 Jun 23, 2022
Unified file system operation experience for different backend

megfile - Megvii FILE library Docs: http://megvii-research.github.io/megfile megfile provides a silky operation experience with different backends (cu

MEGVII Research 76 Dec 14, 2022
This is a five-step framework for the development of intrusion detection systems (IDS) using machine learning (ML) considering model realization, and performance evaluation.

AB-TRAP: building invisibility shields to protect network devices The AB-TRAP framework is applicable to the development of Network Intrusion Detectio

Lab-C2DC - Laboratory of Command and Control and Cyber-security 17 Jan 04, 2023
Trafffic prediction analysis using hybrid models - Machine Learning

Hybrid Machine learning Model Clone the Repository Create a new Directory as assests and download the model from the below link Model Link To Start th

1 Feb 08, 2022
You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors

You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors In this paper, we propose a novel local descriptor-based fra

Haiping Wang 80 Dec 15, 2022
Perturb-and-max-product: Sampling and learning in discrete energy-based models

Perturb-and-max-product: Sampling and learning in discrete energy-based models This repo contains code for reproducing the results in the paper Pertur

Vicarious 2 Mar 14, 2022
Deep Federated Learning for Autonomous Driving

FADNet: Deep Federated Learning for Autonomous Driving Abstract Autonomous driving is an active research topic in both academia and industry. However,

AIOZ AI 12 Dec 01, 2022
Azua - build AI algorithms to aid efficient decision-making with minimum data requirements.

Project Azua 0. Overview Many modern AI algorithms are known to be data-hungry, whereas human decision-making is much more efficient. The human can re

Microsoft 197 Jan 06, 2023
SW components and demos for visual kinship recognition. An emphasis is put on the FIW dataset-- data loaders, benchmarks, results in summary.

FIW Data Development Kit Table of Contents Introduction Families In the Wild Database Publications Organization To Do License Getting Involved Introdu

Joseph P. Robinson 12 Jun 04, 2022
Two types of Recommender System : Content-based Recommender System and Colaborating filtering based recommender system

Recommender-Systems Two types of Recommender System : Content-based Recommender System and Colaborating filtering based recommender system So the data

Yash Kumar 0 Jan 20, 2022
Source code of "Hold me tight! Influence of discriminative features on deep network boundaries"

Hold me tight! Influence of discriminative features on deep network boundaries This is the source code to reproduce the experiments of the NeurIPS 202

EPFL LTS4 19 Dec 10, 2021
Multimodal Descriptions of Social Concepts: Automatic Modeling and Detection of (Highly Abstract) Social Concepts evoked by Art Images

MUSCO - Multimodal Descriptions of Social Concepts Automatic Modeling of (Highly Abstract) Social Concepts evoked by Art Images This project aims to i

0 Aug 22, 2021
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Eugenio Herrera 175 Dec 29, 2022
Pytorch implementation of PCT: Point Cloud Transformer

PCT: Point Cloud Transformer This is a Pytorch implementation of PCT: Point Cloud Transformer.

Yi_Zhang 265 Dec 22, 2022
Process text, including tokenizing and representing sentences as vectors and Applying some concepts like RNN, LSTM and GRU to create a classifier can detect the language in which a sentence is written from among 17 languages.

Language Identifier What is this ? The goal of this project is to create a model that is able to predict a given sentence language through text proces

Hossam Asaad 9 Dec 15, 2022
Reproduction of Vision Transformer in Tensorflow2. Train from scratch and Finetune.

Vision Transformer(ViT) in Tensorflow2 Tensorflow2 implementation of the Vision Transformer(ViT). This repository is for An image is worth 16x16 words

sungjun lee 42 Dec 27, 2022