Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Overview

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

Overview

Dataset and code for paper "Enhancing Keyphrase Extraction from Academic Articles with their Reference Information".

The research content of this project is to analyze the impact of the introduction of reference title in scientific literature on the effect of keyword extraction. This project uses three datasets: SemEval-2010, PubMed and LIS-2000, which are located in the dataset folder. At the same time, we use two unsupervised methods: TF-IDF and TextRank, and three supervised learning methods: NaiveBayes, CRF and BiLSTM-CRF. The first four are traditional keywords extraction methods, located in the folder ML, and the last one is deep learning method, located in the folder DL.

Directory structure

Keyphrase_Extraction:                 Root directory
│  dl.bat:                            Batch commands to run deep learning model
│  ml.bat:                            Batch commands to run traditional models
│ 
├─Dataset:                            Store experimental datasets
│      SemEval-2010:                  Contains 244 scientific papers 
│      PubMed:                        Contains 1316 scientific papers
│      LIS-2000:                      Contains 2000 scientific papers
│ 
├─DL:                                 Store the source code of the deep learning model
│  │  build_path.py:                  Create file paths for saving preprocessed data
│  │  crf.py:                         Source code of CRF algorithm implementation(Use pytorch framework)
│  │  main.py:                        The main function of running the program
│  │  model.py:                       Source code of BiLSTM-CRF model
│  │  preprocess.py:                  Source code of preprocessing function
│  │  textrank.py:                    Source code of TextRank algorithm implementation.
│  │  tf_idf.py:                      Source code of TF-IDF algorithm implementation.
│  │  utils.py:                       Some auxiliary functions
│  ├─models:                          Parameter configuration of deep learning models
│  └─datas
│        tags:                        Label settings for sequence labeling
│ 
└─ML:                                 Store the source code of the traditional models
    │  build_path.py:                 Create file paths for saving preprocessed data
    │  configs.py:                    Path configuration file
    │  crf.py:                        Source code of CRF algorithm implementation(Use CRF++ Toolkit)
    │  evaluate.py:                   Source code for result evaluation
    │  naivebayes.py:                 Source code of naivebayes algorithm implementation(Use KEA-3.0 Toolkit)
    │  preprocessing.py:              Source code of preprocessing function
    │  textrank.py:                   Source code of TextRank algorithm implementation
    │  tf_idf.py:                     Source code of TF-IDF algorithm implementation
    │  utils.py:                      Some auxiliary functions
    ├─CRF++:                          CRF++ Toolkit
    └─KEA-3.0:                        KEA-3.0 Toolkit

Dataset Description

The dataset includes the following three json files:

  • SemEval-2010: SemEval-2010 Task 5 dataset, it contains 244 scientific papers and can be visited at: https://semeval2.fbk.eu/semeval2.php?location=data.
  • PubMed: Contains 1316 scientific papers from PubMed (https://github.com/boudinfl/ake-datasets/tree/master/datasets/PubMed).
  • LIS-2000: Contains 2000 scientific papers from journals in Library and Information Science (LIS).

    Each line of the json file includes:

  • title (T): The title of the paper.
  • abstract (A): The abstract of the paper.
  • introduction (I): The introduction of the paper.
  • conclusion (C): The conclusion of the paper.
  • body1 (Fp): The first sentence of each paragraph.
  • body2 (Lp): The last sentence of each paragraph.
  • full_text (F): The full text of the paper.
  • references (R): references list and only the title of each reference is provided.
  • keywords (K): the keywords of the paper and these keywords were annotated manually.

    Quick Start

    In order to facilitate the reproduction of the experimental results, the project uses bat batch command to run the program uniformly (only in Windows Environment). The dl.bat file is the batch command to run the deep learning model, and the ml.bat file is the batch command to run the traditional algorithm.

    How does it work?

    In the Windows environment, use the key combination Win + R and enter cmd to open the DOS command box, and switch to the project's root directory (Keyphrase_Extraction). Then input dl.bat, that is, run deep learning model to get the result of keyword extraction; Enter ml.bat to run traditional algorithm to get keywords Extract the results.

    Experimental results

    The following figures show that the influence of reference information on keyphrase extraction results of TF*IDF, TextRank, NB, CRF and BiLSTM-CRF.

    Table 1: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of SemEval-2010 Table1

    Table 2: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of PubMed Table2

    Table 3: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of LIS-2000 Table3

    Note: The yellow, green and blue bold fonts in the table represent the largest of the P, R and F1 value obtained from different corpora using the same model, respectively.

    Dependency packages

    Before running this project, check that the following Python packages are included in your runtime environment.

  • pytorch 1.7.1
  • nltk 3.5
  • numpy 1.19.2
  • pandas 1.1.3
  • tqdm 4.50.2

    Citation

    Please cite the following paper if you use this codes and dataset in your work.

    Chengzhi Zhang, Lei Zhao, Mengyuan Zhao, Yingyi Zhang. Enhancing Keyphrase Extraction from Academic Articles with their Reference Information. Scientometrics, 2021. (in press) [arXiv]

  • Owner
    Professor at iSchool of Nanjing University of Science and Technology
    Implementation of Basic Machine Learning Algorithms on small datasets using Scikit Learn.

    Basic Machine Learning Algorithms All the basic Machine Learning Algorithms are implemented in Python using libraries Acknowledgements Machine Learnin

    Piyal Banik 47 Oct 16, 2022
    A pre-trained model with multi-exit transformer architecture.

    ElasticBERT This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

    fastNLP 48 Dec 14, 2022
    ATAC: Adversarially Trained Actor Critic

    ATAC: Adversarially Trained Actor Critic Adversarially Trained Actor Critic for Offline Reinforcement Learning by Ching-An Cheng*, Tengyang Xie*, Nan

    Microsoft 41 Dec 08, 2022
    This repository lets you interact with Lean through a REPL.

    lean-gym This repository lets you interact with Lean through a REPL. See Formal Mathematics Statement Curriculum Learning for a presentation of lean-g

    OpenAI 87 Dec 28, 2022
    Advbox is a toolbox to generate adversarial examples that fool neural networks in PaddlePaddle、PyTorch、Caffe2、MxNet、Keras、TensorFlow and Advbox can benchmark the robustness of machine learning models.

    Advbox is a toolbox to generate adversarial examples that fool neural networks in PaddlePaddle、PyTorch、Caffe2、MxNet、Keras、TensorFlow and Advbox can benchmark the robustness of machine learning models

    AdvBox 1.3k Dec 25, 2022
    BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

    BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment

    Holy Wu 35 Jan 01, 2023
    End-to-end beat and downbeat tracking in the time domain.

    WaveBeat End-to-end beat and downbeat tracking in the time domain. | Paper | Code | Video | Slides | Setup First clone the repo. git clone https://git

    Christian J. Steinmetz 60 Dec 24, 2022
    Swapping face using Face Mesh with TensorFlow Lite

    Swapping face using Face Mesh with TensorFlow Lite

    iwatake 17 Apr 26, 2022
    Discovering and Achieving Goals via World Models

    Discovering and Achieving Goals via World Models [Project Website] [Benchmark Code] [Video (2min)] [Oral Talk (13min)] [Paper] Russell Mendonca*1, Ole

    Oleg Rybkin 71 Dec 22, 2022
    Multi-Template Mouse Brain MRI Atlas (MBMA): both in-vivo and ex-vivo

    Multi-template MRI mouse brain atlas (both in vivo and ex vivo) Mouse Brain MRI atlas (both in-vivo and ex-vivo) (repository relocated from the origin

    8 Nov 18, 2022
    Open source code for Paper "A Co-Interactive Transformer for Joint Slot Filling and Intent Detection"

    A Co-Interactive Transformer for Joint Slot Filling and Intent Detection This repository contains the PyTorch implementation of the paper: A Co-Intera

    67 Dec 05, 2022
    Chainer Implementation of Semantic Segmentation using Adversarial Networks

    Semantic Segmentation using Adversarial Networks Requirements Chainer (1.23.0) Differences Use of FCN-VGG16 instead of Dilated8 as Segmentor. Caution

    Taiki Oyama 99 Jun 28, 2022
    Using machine learning to predict undergrad college admissions.

    College-Prediction Project- Overview: Many have tried, many have failed. Few trailblazers are ambitious enought to chase acceptance into the top 15 un

    John H Klinges 1 Jan 05, 2022
    Machine learning Bot detection technique, based on United States election dataset

    Machine learning Bot detection technique, based on United States election dataset (2020). Current github repo provides implementation described in pap

    Alexander Shevtsov 4 Nov 20, 2022
    Pytorch implementation of MixNMatch

    MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation [Paper] Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Le

    910 Dec 30, 2022
    Code for the paper "Reinforced Active Learning for Image Segmentation"

    Reinforced Active Learning for Image Segmentation (RALIS) Code for the paper Reinforced Active Learning for Image Segmentation Dependencies python 3.6

    Arantxa Casanova 79 Dec 19, 2022
    Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neurons learned with Gradient descent or LeLevenberg–Marquardt algorithm

    Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neu

    Filip Molcik 38 Dec 17, 2022
    Deep-Learning-Image-Captioning - Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

    Deep Learning - Image Captioning with Convolutional and Recurrent Neural Nets ========================================================================

    23 Apr 06, 2022
    Charsiu: A transformer-based phonetic aligner

    Charsiu: A transformer-based phonetic aligner [arXiv] Note. This is a preview version. The aligner is under active development. New functions, new lan

    jzhu 166 Dec 09, 2022