Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Overview

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

Overview

Dataset and code for paper "Enhancing Keyphrase Extraction from Academic Articles with their Reference Information".

The research content of this project is to analyze the impact of the introduction of reference title in scientific literature on the effect of keyword extraction. This project uses three datasets: SemEval-2010, PubMed and LIS-2000, which are located in the dataset folder. At the same time, we use two unsupervised methods: TF-IDF and TextRank, and three supervised learning methods: NaiveBayes, CRF and BiLSTM-CRF. The first four are traditional keywords extraction methods, located in the folder ML, and the last one is deep learning method, located in the folder DL.

Directory structure

Keyphrase_Extraction:                 Root directory
│  dl.bat:                            Batch commands to run deep learning model
│  ml.bat:                            Batch commands to run traditional models
│ 
├─Dataset:                            Store experimental datasets
│      SemEval-2010:                  Contains 244 scientific papers 
│      PubMed:                        Contains 1316 scientific papers
│      LIS-2000:                      Contains 2000 scientific papers
│ 
├─DL:                                 Store the source code of the deep learning model
│  │  build_path.py:                  Create file paths for saving preprocessed data
│  │  crf.py:                         Source code of CRF algorithm implementation(Use pytorch framework)
│  │  main.py:                        The main function of running the program
│  │  model.py:                       Source code of BiLSTM-CRF model
│  │  preprocess.py:                  Source code of preprocessing function
│  │  textrank.py:                    Source code of TextRank algorithm implementation.
│  │  tf_idf.py:                      Source code of TF-IDF algorithm implementation.
│  │  utils.py:                       Some auxiliary functions
│  ├─models:                          Parameter configuration of deep learning models
│  └─datas
│        tags:                        Label settings for sequence labeling
│ 
└─ML:                                 Store the source code of the traditional models
    │  build_path.py:                 Create file paths for saving preprocessed data
    │  configs.py:                    Path configuration file
    │  crf.py:                        Source code of CRF algorithm implementation(Use CRF++ Toolkit)
    │  evaluate.py:                   Source code for result evaluation
    │  naivebayes.py:                 Source code of naivebayes algorithm implementation(Use KEA-3.0 Toolkit)
    │  preprocessing.py:              Source code of preprocessing function
    │  textrank.py:                   Source code of TextRank algorithm implementation
    │  tf_idf.py:                     Source code of TF-IDF algorithm implementation
    │  utils.py:                      Some auxiliary functions
    ├─CRF++:                          CRF++ Toolkit
    └─KEA-3.0:                        KEA-3.0 Toolkit

Dataset Description

The dataset includes the following three json files:

  • SemEval-2010: SemEval-2010 Task 5 dataset, it contains 244 scientific papers and can be visited at: https://semeval2.fbk.eu/semeval2.php?location=data.
  • PubMed: Contains 1316 scientific papers from PubMed (https://github.com/boudinfl/ake-datasets/tree/master/datasets/PubMed).
  • LIS-2000: Contains 2000 scientific papers from journals in Library and Information Science (LIS).

    Each line of the json file includes:

  • title (T): The title of the paper.
  • abstract (A): The abstract of the paper.
  • introduction (I): The introduction of the paper.
  • conclusion (C): The conclusion of the paper.
  • body1 (Fp): The first sentence of each paragraph.
  • body2 (Lp): The last sentence of each paragraph.
  • full_text (F): The full text of the paper.
  • references (R): references list and only the title of each reference is provided.
  • keywords (K): the keywords of the paper and these keywords were annotated manually.

    Quick Start

    In order to facilitate the reproduction of the experimental results, the project uses bat batch command to run the program uniformly (only in Windows Environment). The dl.bat file is the batch command to run the deep learning model, and the ml.bat file is the batch command to run the traditional algorithm.

    How does it work?

    In the Windows environment, use the key combination Win + R and enter cmd to open the DOS command box, and switch to the project's root directory (Keyphrase_Extraction). Then input dl.bat, that is, run deep learning model to get the result of keyword extraction; Enter ml.bat to run traditional algorithm to get keywords Extract the results.

    Experimental results

    The following figures show that the influence of reference information on keyphrase extraction results of TF*IDF, TextRank, NB, CRF and BiLSTM-CRF.

    Table 1: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of SemEval-2010 Table1

    Table 2: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of PubMed Table2

    Table 3: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of LIS-2000 Table3

    Note: The yellow, green and blue bold fonts in the table represent the largest of the P, R and F1 value obtained from different corpora using the same model, respectively.

    Dependency packages

    Before running this project, check that the following Python packages are included in your runtime environment.

  • pytorch 1.7.1
  • nltk 3.5
  • numpy 1.19.2
  • pandas 1.1.3
  • tqdm 4.50.2

    Citation

    Please cite the following paper if you use this codes and dataset in your work.

    Chengzhi Zhang, Lei Zhao, Mengyuan Zhao, Yingyi Zhang. Enhancing Keyphrase Extraction from Academic Articles with their Reference Information. Scientometrics, 2021. (in press) [arXiv]

  • Owner
    Professor at iSchool of Nanjing University of Science and Technology
    📚 A collection of all the Deep Learning Metrics that I came across which are not accuracy/loss.

    📚 A collection of all the Deep Learning Metrics that I came across which are not accuracy/loss.

    Rahul Vigneswaran 1 Jan 17, 2022
    CSD: Consistency-based Semi-supervised learning for object Detection

    CSD: Consistency-based Semi-supervised learning for object Detection (NeurIPS 2019) By Jisoo Jeong, Seungeui Lee, Jee-soo Kim, Nojun Kwak Installation

    80 Dec 15, 2022
    A simple and extensible library to create Bayesian Neural Network layers on PyTorch.

    Blitz - Bayesian Layers in Torch Zoo BLiTZ is a simple and extensible library to create Bayesian Neural Network Layers (based on whats proposed in Wei

    Pi Esposito 722 Jan 08, 2023
    Rethinking Transformer-based Set Prediction for Object Detection

    Rethinking Transformer-based Set Prediction for Object Detection Here are the code for the ICCV paper. The code is adapted from Detectron2 and AdelaiD

    Zhiqing Sun 62 Dec 03, 2022
    pytorch implementation of dftd2 & dftd3

    torch-dftd pytorch implementation of dftd2 [1] & dftd3 [2, 3] Install # Install from pypi pip install torch-dftd # Install from source (for developer

    33 Nov 28, 2022
    This repository is based on Ultralytics/yolov5, with adjustments to enable polygon prediction boxes.

    Polygon-Yolov5 This repository is based on Ultralytics/yolov5, with adjustments to enable polygon prediction boxes. Section I. Description The codes a

    xinzelee 226 Jan 05, 2023
    [ICML 2021] Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

    Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data This repo provides the source code & data of our paper: Break-It-Fix-It: Unsupervised

    Michihiro Yasunaga 86 Nov 30, 2022
    A setup script to generate ITK Python Wheels

    ITK Python Package This project provides a setup.py script to build ITK Python binary packages and infrastructure to build ITK external module Python

    Insight Software Consortium 59 Dec 14, 2022
    Rate-limit-semaphore - Semaphore implementation with rate limit restriction for async-style (any core)

    Rate Limit Semaphore Rate limit semaphore for async-style (any core) There are t

    Yan Kurbatov 4 Jun 21, 2022
    Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

    UPDeT Official Implementation of UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers (ICLR 2021 spotlight) The

    hhhusiyi 96 Dec 22, 2022
    Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

    DynaBOA Code repositoty for the paper: Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation Shanyan Guan, Jingwei Xu, Michell

    197 Jan 07, 2023
    EfficientMPC - Efficient Model Predictive Control Implementation

    efficientMPC Efficient Model Predictive Control Implementation The original algo

    Vin 8 Dec 04, 2022
    A JAX-based research framework for writing differentiable numerical simulators with arbitrary discretizations

    jaxdf - JAX-based Discretization Framework Overview | Example | Installation | Documentation ⚠️ This library is still in development. Breaking changes

    UCL Biomedical Ultrasound Group 65 Dec 23, 2022
    Roger Labbe 13k Dec 29, 2022
    A real world application of a Recurrent Neural Network on a binary classification of time series data

    What is this This is a real world application of a Recurrent Neural Network on a binary classification of time series data. This project includes data

    Josep Maria Salvia Hornos 2 Jan 30, 2022
    Physics-informed Neural Operator for Learning Partial Differential Equation

    PINO Physics-informed Neural Operator for Learning Partial Differential Equation Abstract: Machine learning methods have recently shown promise in sol

    107 Jan 02, 2023
    Bridging the Gap between Label- and Reference based Synthesis(ICCV 2021)

    Bridging the Gap between Label- and Reference based Synthesis(ICCV 2021) Tensorflow implementation of Bridging the Gap between Label- and Reference-ba

    huangqiusheng 8 Jul 13, 2022
    PyTorch implementation of Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy

    Anomaly Transformer in PyTorch This is an implementation of Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. This pape

    spencerbraun 160 Dec 19, 2022
    Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

    Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

    Manling Li 49 Nov 21, 2022
    Neural network pruning for finding a sparse computational model for controlling a biological motor task.

    MothPruning Scientific Overview Originally inspired by biological nervous systems, deep neural networks (DNNs) are powerful computational tools for mo

    Olivia Thomas 0 Dec 14, 2022