Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Last update: Dec 23, 2022

Overview

FENSE

The metric, Fluency ENhanced Sentence-bert Evaluation (FENSE), for audio caption evaluation, proposed in the paper "Can Audio Captions Be Evaluated with Image Caption Metrics?"

The main branch contains an easy-to-use interface for fast evaluation of an audio captioning system.

Online demo avaliable at https://share.streamlit.io/blmoistawinde/fense/main/streamlit_demo/app.py .

To get the dataset (AudioCaps-Eval and Clotho-Eval) and the code to reproduce, please refer to the experiment-code branch.

Installation

Clone the repository and pip install it.

git clone https://github.com/blmoistawinde/fense.git
cd fense
pip install -e .

Usage

Single Sentence

To get the detailed scores of each component for a single sentence.

from fense.evaluator import Evaluator

print("----Using tiny models----")
evaluator = Evaluator(device='cpu', sbert_model='paraphrase-MiniLM-L6-v2', echecker_model='echecker_clotho_audiocaps_tiny')

eval_cap = "An engine in idling and a man is speaking and then"
ref_cap = "A machine makes stitching sounds while people are talking in the background"

score, error_prob, penalized_score = evaluator.sentence_score(eval_cap, [ref_cap], return_error_prob=True)

print("Cand:", eval_cap)
print("Ref:", ref_cap)
print(f"SBERT sim: {score:.4f}, Error Prob: {error_prob:.4f}, Penalized score: {penalized_score:.4f}")

System Score

To get a system's overall score on a dataset by averaging sentence-level FENSE, you can use eval_system.py, with your system outputs prepared in the format like test_data/audiocaps_cands.csv or test_data/clotho_cands.csv .

For AudioCaps test set:

python eval_system.py --device cuda --dataset audiocaps --cands_dir ./test_data/audiocaps_cands.csv

For Clotho Eval set:

python eval_system.py --device cuda --dataset clotho --cands_dir ./test_data/clotho_cands.csv

Performance Benchmark

We benchmark the performance of FENSE with different choices of SBERT model and Error Detector on the two benchmark dataset AudioCaps-Eval and Clotho-Eval. (*) is the combination reported in paper.

AudioCaps-Eval

SBERT	echecker	HC	HI	HM	MM	total
paraphrase-MiniLM-L6-v2	none	62.1	98.8	93.7	75.4	80.4
paraphrase-MiniLM-L6-v2	tiny	57.6	94.7	89.5	82.6	82.3
paraphrase-MiniLM-L6-v2	base	62.6	98	82.5	85.4	85.5
paraphrase-TinyBERT-L6-v2	none	64	99.2	92.5	73.6	79.6
paraphrase-TinyBERT-L6-v2	tiny	58.6	95.1	88.3	82.2	82.1
paraphrase-TinyBERT-L6-v2	base	64.5	98.4	91.6	84.6	85.3(*)
paraphrase-mpnet-base-v2	none	63.1	98.8	94.1	74.1	80.1
paraphrase-mpnet-base-v2	tiny	58.1	94.3	90	83.2	82.7
paraphrase-mpnet-base-v2	base	63.5	98	92.5	85.9	85.9

Clotho-Eval

SBERT	echecker	HC	HI	HM	MM	total
paraphrase-MiniLM-L6-v2	none	59.5	95.1	76.3	66.2	71.3
paraphrase-MiniLM-L6-v2	tiny	56.7	90.6	79.3	70.9	73.3
paraphrase-MiniLM-L6-v2	base	60	94.3	80.6	72.3	75.3
paraphrase-TinyBERT-L6-v2	none	60	95.5	75.9	66.9	71.8
paraphrase-TinyBERT-L6-v2	tiny	59	93	79.7	71.5	74.4
paraphrase-TinyBERT-L6-v2	base	60.5	94.7	80.2	72.8	75.7(*)
paraphrase-mpnet-base-v2	none	56.2	96.3	77.6	65.2	70.7
paraphrase-mpnet-base-v2	tiny	54.8	91.8	80.6	70.1	73
paraphrase-mpnet-base-v2	base	57.1	95.5	81.9	71.6	74.9

Reference

If you use FENSE in your research, please cite:

@misc{zhou2021audio,
      title={Can Audio Captions Be Evaluated with Image Caption Metrics?}, 
      author={Zelin Zhou and Zhiling Zhang and Xuenan Xu and Zeyu Xie and Mengyue Wu and Kenny Q. Zhu},
      year={2021},
      eprint={2110.04684},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

You might also like...

I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

139 Dec 27, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Related tags

Overview

FENSE

Installation

Usage

Single Sentence

System Score

Performance Benchmark

Reference

You might also like...

I-BERT: Integer-only BERT Quantization

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

Pure python PEMDAS expression solver without using built-in eval function

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Yet another video caption

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Releases(V0.1)

V0.1(Oct 2, 2021)

Owner

Zhiling Zhang

Multimodal commodity image retrieval 多模态商品图像检索

Delta Conformity Sociopatterns Analysis - Delta Conformity Sociopatterns Analysis

Experiments with differentiable stacks and queues in PyTorch

Match SafeGraph POIs with Data collected through a cultural resource survey in Washington DC.

Oriented Response Networks, in CVPR 2017

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

Awesome-google-colab - Google Colaboratory Notebooks and Repositories

Submission to Twitter's algorithmic bias bounty challenge

Regularized Frank-Wolfe for Dense CRFs: Generalizing Mean Field and Beyond

[ICCV 2021 Oral] Deep Evidential Action Recognition

An implementation of the 1. Parallel, 2. Streaming, 3. Randomized SVD using MPI4Py

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

Estimating Example Difficulty using Variance of Gradients

TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular potentials

Hamiltonian Dynamics with Non-Newtonian Momentum for Rapid Sampling

Microscopy Image Cytometry Toolkit

Histology images query (unsupervised)

Implementation of FitVid video prediction model in JAX/Flax.