An Explainable Leaderboard for NLP

Last update: Dec 20, 2022

Related tags

Text Data & NLP ExplainaBoard

Overview

ExplainaBoard: An Explainable Leaderboard for NLP

Introduction

ExplainaBoard is an interpretable, interactive and reliable leaderboard with seven (so far) new features (F) compared with generic leaderboard.

F1: Single-system Analysis: What is a system good or bad at?
F2: Pairwise Analysis: Where is one system better (worse) than another?
F3: Data Bias Analysis: What are the characteristics of different evaluated datasets?
F5: Common errors: What are common mistakes that top-5 systems made?
F6: Fine-grained errors: where will errors occur?
F7: System Combination: Is there potential complementarity between different systems?

Website

We deploy ExplainaBoard as a Web toolkit, which includes 9 NLP tasks, 40 datasets and 300 systems. Detailed information is as follows.

Task

Task	Sub-task	Dataset	Model	Attribute
	Sentiment	8	40	2
Text Classification	Topics	4	18	2
	Intention	1	3	2
Text-Span Classification	Aspect Sentiment	4	20	4
Text pair Classification	NLI	2	6	7
	NER	3	74	9
Sequence Labeling	POS	3	14	4
	Chunking	3	14	9
	CWS	7	64	7
Structure Prediction	Semantic Parsing	4	12	4
Text Generation	Summarization	2	36	7

Download System Outputs

We haven't released datasets or corresponding system outputs that require licenses. But If you have licenses please fill in this form and we will send them to you privately. (Description of output's format can refer here If these system outputs are useful for you, you can cite our work.

Test Your Results

pip install -r requirements.txt

Description of Each Directory

task-[task_name]: fine-grained analysis for each task, aiming to generating fine-grained analysis results with the json format. For example, task-mlqa can calculate the fine-graied F1 scores for different systems, and output corresponding json files in task-mlqa/output/ .
meta-eval is a sort of controller, which can be used to start the fine-graind anlsysis of all tasks, and analyze output json files.
- calculate fine-grained results for all tasks: ./meta-eval/run-allTasks.sh
```
    cd ./meta-eval/
    ./run-allTasks.sh
```
- merge json files of all tasks into a csv file, which would be useful for further SQL import: ./meta-eval/genCSV/json2csv.py
```
    cd ./meta-eval/genCSV/json2csv.py
    python json2csv.py > explainabord.csv
```
src stores some auxiliary codes.

Submit Your Results

You can submit your system's output by this form following the format description.

Acknowledgement

We thanks all authors who share their system outputs with us: Ikuya Yamada, Stefan Schweter, Colin Raffel, Yang Liu, Li Dong. We also thank Vijay Viswanathan, Yiran Chen, Hiroaki Hayashi for useful discussion and feedback about ExplainaBoard.

Comments

Is the current applicable condition of t-test correct?

The Metric.calc_confidence_interval method considers t-test applicable only when a given evaluation metric score is a simple average of statistics. I'm just wondering whether another condition needs to be checked or not. I would appreciate if there is a reference on the applicable condition of t-test.
question

opened by tetsuok 22
Allowed specification of the metric #dimensions

This PR loosens the restriction that sufficient statistics must be a vector, and allows them to be a tensor with the dimension equal to Metric.stats_ndim().

It also demonstrates how this works on the NLGMetaEvaluation metric.

@pfliu-nlp and @odashi : could you please check this PR as a potential solution to the discussion in https://github.com/neulab/ExplainaBoard/pull/527 ?

(sorry, after sending the review request I made a change of naming from dim->ndim, which I think is more in line with the naming in numpy)

opened by neubig 12

test_generate_system_analysis in integration_tests.summarization_test.SummarizationTest is too slow

commit 8c514c3d81a079d967d208f8bc330c2f202620bb (#437) increases the execution time of integration_tests.summarization_test.SummarizationTest. When I measured on my GCP VM, the time of the test increased by 430 seconds (from 6 seconds to 436 seconds), which is too slow to run as automated tests in pull requests. Slow tests need to be removed or replaced with more focused and fast tests. In general, having slow tests leads to productivity drains: Time to update pull requests takes longer, developers would try to include large commits into pull requests to work around slow CI time, pull requests become expensive to review, which makes identifying bugs or design flaws in code review difficult.

Repro steps

rm -rf ~/.cache/explainaboard
time python -m unittest -v integration_tests.summarization_test.SummarizationTest

Output

test_datalab_loader (integration_tests.summarization_test.SummarizationTest) ... skipped 'time consuming'
test_default_features_dont_modify_condgen (integration_tests.summarization_test.SummarizationTest) ... ok
test_generate_system_analysis (integration_tests.summarization_test.SummarizationTest) ... WARNING:datalabs.load:Couldn't find a directory or a dataset named 'cnn_dailymail' in this version. It was picked from the master branch on github instead.
WARNING:datalabs.builder:No config specified, defaulting to: cnn_dailymail/3.0.0
WARNING:datalabs.builder:Reusing dataset cnn_dailymail (/home/t/.cache/expressai/datalab/cnn_dailymail/3.0.0/3.0.0/6e2f5d689f0225c4f22eb78d11ba7a21399810c5cb853edafe39b1d006a1ff95)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287113/287113 [06:20<00:00, 755.03it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 287113/287113 [00:29<00:00, 9616.19it/s]
INFO:explainaboard:caching stats for cnn_dailymail None
calculating example-level features: 3it [00:00, 51.88it/s]
calculating token-level features: 3it [00:00, 139.83it/s]
/home/t/explainaboard-fork/explainaboard/metrics/metric.py:336: DeprecationWarning: Use of keyword argument `alpha` for method `interval` is deprecated. Use first positional argument or keyword argument `confidence` instead.
  return stats_t.interval(
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 349.50it/s]
ok
test_generate_system_human_eval (integration_tests.summarization_test.SummarizationTest) ... skipped 'Not yet fixed in v0.11'
test_load_tsv (integration_tests.summarization_test.SummarizationTest) ... ok

----------------------------------------------------------------------
Ran 5 tests in 438.659s

OK (skipped=2)
python -m unittest -v integration_tests.summarization_test.SummarizationTest  434.35s user 2.58s system 98% cpu 7:22.46 total

opened by tetsuok 12

Use 'confidence' instead of deprecated 'alpha' for scipy.stats.t.interval

Reduced heavy logging uncovered buried DeprecationWarnings in tests. We get the following DeprecationWarning in the tests that invoke scipy.stats.t.interval method:

test_hits (explainaboard.tests.test_metric.TestMetric) ... /home/runner/work/ExplainaBoard/ExplainaBoard/explainaboard/metrics/metric.py:338: DeprecationWarning: Use of keyword argument `alpha` for method `interval` is deprecated. Use first positional argument or keyword argument `confidence` instead.

This PR fixes the warning as the warning suggests.

opened by tetsuok 12

Cache pip dependencies to speed up CI

This PR attempts to speed up both unit-tests and integration-tests CI jobs. Every CI job spends about 2 minutes on installing pip packages. The step dominates about 90% of the total time of unit-tests and about 30% of the total time of integration-tests. The step to install pip packages can be skipped by creating virtual environments and caching the installed packages onto the environments using actions/cache. Note that actions/[email protected] doesn't support caching installed packages. It only allow to avoid re-downloading by caching downloaded packages from PyPI under ~/.cache/pip.

Dependencies listed in setup.py are moved to requirements.txt. This is to generate lock files for every Python version from requirements.txt. The generated lock files are used as keys to caches to properly invalidate when dependencies are updated. Unless dependencies are changed, every CI job should be reproducible (with respect to installing pip dependencies). Making the CI jobs reproducible and faster achieves at the expense of periodical updates of these lock files. Maintaining lock files for dependencies is pretty common in other programming languages such as JS and Rust. This update can be done by running cicd/gen_requirements_lock.sh.

opened by tetsuok 12
Refactor/loaders
Commit 1: refactored Loader.__init__()

made data a required argument

all loaders now call the __init__ method of the base loader

Commit 2: implemented file-specific loaders to simplify the task-specific loaders

implements TSVFileLoader, JSONFileLoader, DatalabFileLoader and CoNLLFileLoader which knows how to load a certain type of file given the fields

refactored all the existing loaders to use these file-specific loaders instead

QAMultipleChoiceLoader KgLinkTailPredictionLoader still uses custom load() methods because they support user-defined features. The way they load these extra features is different so I decided to leave them for now. It'll be easy to incorporate user-defined features to the file loaders (we just need to update the fields based on self.user_defined_features_configs)

hellaswag is removed in https://github.com/neulab/ExplainaBoard/commit/4b93b9542b714754eb91d718cd82b98ab706d11c

This refactor makes it easier to do #141 in the future. We just need to have two sets of file loaders for each task-specific loader. One is for the (input, reference_output) file and the other one is for the predictions file.

Please let me know what you think! Thanks!
opened by lyuyangh 12

Potential issue with spearman R bootstrapping

We observed the following test failure when integrating another PR:

======================================================================
FAIL: test_sample_level_spearmanr_bootstrap (integration_tests.meta_eval_wmt_da_test.MetaEvalNLGCITest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ExplainaBoard/ExplainaBoard/integration_tests/meta_eval_wmt_da_test.py", line 191, in test_sample_level_spearmanr_bootstrap
    self.assertAlmostEqual(ci[0], 0.6488, 2)
AssertionError: 0.7325904563487001 != 0.6488 within 2 places (0.08379045634870008 difference)

----------------------------------------------------------------------

We are not sure whether this is an issue with the test or the underlying code, but as a temporary measure we reduced the sensitivity of the test. We should go back and check to make sure whether this is just due to bootstrapping variance or whether it's due to a bug in the test itself.

opened by neubig 10

Implement CalibrationAnalysis

Calibration is whether a system's confidence is well-correlated with whether the system got the answer right or not. It would be nice if we could do analyses related to calibration, such as calculating expected calibration error: https://arxiv.org/abs/1706.04599

I think this should probably be implemented as an additional variety of analysis, which would be simple and self-contained: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/analyses.py#L45
good first issue new-analysis

opened by neubig 10
Correct training set feature field names
Previously, calculation of training set features would fail if the datalab dataset used unconventional column names.

This does the following things:

Makes an option to use Loader to load only datasets without system outputs if output_data is set to None

Changes _statistics_func to simply take in the samples and system info, and return the statistics (in contrast to previously using the datalab aggregating() functionality.

Loads data used in calculating training features through Loader so that appropriate field mapping will be performed

Fixes https://github.com/neulab/ExplainaBoard/issues/416

Notably, @pfliu-nlp, "2." may require some discussion, here are the pros and cons of doing it this new way:

Pros

it makes the statistics code self-contained and not rely on an external library. honestly, even though I'm very familiar with explainaboard, I was always a bit confused about what was actually going on here because the aggregating() decorator was a bit mysterious to me

statistics_func can now be called on any set of samples, so it could be called on a non-datalab dataset. this may be useful if we want to, for example, calculate training set features with custom datasets

Cons

the datalab aggregating operator may have implemented parallelism so this aggregation of statistics might be able to be done faster? but I actually am not sure if that's actually the case in practice

something else I'm missing?
opened by neubig 9
Unsafe en_core_web_sm downloading in setup.py
Currently setup.py will execute an external command python -m spacy download en_core_web_sm to install a spaCy model during setup. This approach has several issues about system consystency:

spaCy models are intendedly not registered to PyPI, and PyPI does not allow libraries depending on external requirements.

The command is just a system command which possibly breaks the system, or won't work correctly.

Since there is no recommended way to add spaCy models to install_requires, we need to take either of follows:

Download the model programatically when spacy.load() fails.

Bundle the model file into this repository.

Ask users to download appropriate models additionally.
opened by odashi 9
How to name metrics when registering them
There are two ways to name metrics

(1)

@dataclass @metric_config_registry.register("AccuracyConfig") class AccuracyConfig(MetricConfig): def to_metric(self): return Accuracy(self)

(2)

@dataclass @metric_config_registry.register("Accuracy") class AccuracyConfig(MetricConfig): def to_metric(self): return Accuracy(self)

Currently, we are using (1), which, however, is inconsistent with how the Processor names them. For example:

https://github.com/neulab/ExplainaBoard/blob/cd54c1b61e490295db8c1cfee8460aff4cce1880/explainaboard/processors/text_classification.py#L132

Which one do you prefer?

If we go with (2), this code should be modified to avoid naming bug: https://github.com/neulab/ExplainaBoard/blob/cd54c1b61e490295db8c1cfee8460aff4cce1880/explainaboard/metrics/registry.py#L11

config_cls = metric_config_registry.get_type(dikt["name"]) # instead of type

I could send a PR of this.
opened by pfliu-nlp 8
add tests for meval to replicate paper results
Overview

This PR adds tests to verify whether our implemented meta-evaluation processor is able to replicate reported results from existing published papers.

Relevant issue: https://github.com/inspired-co/taskboard/issues/180

Details

Collect system outputs from this repo of two metrics (rouge1 and bartscore)

Using Explainaboard to process these outputs and compare the results with the ones reported from the above repo.

References

Paper: BARTSCORE: Evaluating Generated Text as Text Generation

Code: https://github.com/neulab/BARTScore
opened by pfliu-nlp 0

`TypeError: 'type' object is not subscriptable` when attempt to import or use CLI

How I install ?

pip install explainaboard
or
pip install -U --force-reinstall explainaboard

Both cause same problem

Version : 0.12.3

When try to import explainaboard, or run explainaboard from CLI, same error:

Python 3.8.15 (default, Nov 24 2022, 15:19:38) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import explainaboard
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cpu12595/miniconda3/envs/nlppytorch/lib/python3.8/site-packages/explainaboard/__init__.py", line 6, in <module>
    from explainaboard.loaders import DatalabLoaderOption, get_loader_class
  File "/home/cpu12595/miniconda3/envs/nlppytorch/lib/python3.8/site-packages/explainaboard/loaders/__init__.py", line 5, in <module>
    from explainaboard.loaders import file_loader, loader_factory
  File "/home/cpu12595/miniconda3/envs/nlppytorch/lib/python3.8/site-packages/explainaboard/loaders/file_loader.py", line 18, in <module>
    from explainaboard.analysis.analyses import Analysis
  File "/home/cpu12595/miniconda3/envs/nlppytorch/lib/python3.8/site-packages/explainaboard/analysis/analyses.py", line 14, in <module>
    from explainaboard.analysis.bucketing import get_bucketing_method
  File "/home/cpu12595/miniconda3/envs/nlppytorch/lib/python3.8/site-packages/explainaboard/analysis/bucketing.py", line 13, in <module>
    from explainaboard.serialization.types import SerializableData
  File "/home/cpu12595/miniconda3/envs/nlppytorch/lib/python3.8/site-packages/explainaboard/serialization/__init__.py", line 8, in <module>
    from explainaboard.serialization.types import Serializable
  File "/home/cpu12595/miniconda3/envs/nlppytorch/lib/python3.8/site-packages/explainaboard/serialization/types.py", line 21, in <module>
    list["PrimitiveData"],  # type: ignore
TypeError: 'type' object is not subscriptable

opened by ttpro1995 0

Bump mypy version to 0.990

Since mypy 0.990 was released yesterday (blog post), it would be better to bump mypy version to 0.990 to take advantage of the new features and bug fixes. It seems there is some sort of efforts to be made to adopt the version when I run mypy 0.990 in the codebase of explainaboard. Below is the output of pre-commit run mypy --color=never --all-files

mypy.....................................................................Failed
- hook id: mypy
- exit code: 1

explainaboard/utils/spacy_loader.py:5: error: Cannot find implementation or library stub for module named "spacy"  [import]
explainaboard/utils/spacy_loader.py:6: error: Cannot find implementation or library stub for module named "spacy.language"  [import]
explainaboard/utils/agreement.py:5: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/sum_attribute.py:8: error: Cannot find implementation or library stub for module named "nltk"  [import]
explainaboard/analysis/sum_attribute.py:10: error: Cannot find implementation or library stub for module named "nltk.util"  [import]
explainaboard/utils/async_eaas.py:10: error: Cannot find implementation or library stub for module named "eaas"  [import]
explainaboard/third_party/text_to_sql_test_suit_eval/parse.py:7: error: Cannot find implementation or library stub for module named "sqlparse"  [import]
explainaboard/third_party/text_to_sql_test_suit_eval/parse.py:8: error: Cannot find implementation or library stub for module named "sqlparse.sql"  [import]
explainaboard/third_party/text_to_sql_test_suit_eval/parse.py:9: error: Cannot find implementation or library stub for module named "sqlparse.tokens"  [import]
setup.py:3: error: Skipping analyzing "setuptools": module is installed, but missing library stubs or py.typed marker  [import]
explainaboard/metrics/auxiliary/qa_table_text_hybrid_auxiliary.py:16: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/auxiliary/qa_table_text_hybrid_auxiliary.py:17: error: Cannot find implementation or library stub for module named "scipy.optimize"  [import]
explainaboard/utils/logging.py:9: error: Library stubs not installed for "tqdm"  [import]
explainaboard/utils/logging.py:9: note: Hint: "python3 -m pip install types-tqdm"
explainaboard/utils/logging.py:9: note: (or run "mypy --install-types" to install all missing stub packages)
explainaboard/utils/logging.py:16: error: Incompatible default for argument "desc" (default has type "None", argument has type "str")  [assignment]
explainaboard/utils/logging.py:16: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
explainaboard/utils/logging.py:16: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
explainaboard/visualizers/bar_chart.py:8: error: Cannot find implementation or library stub for module named "matplotlib"  [import]
explainaboard/visualizers/bar_chart.py:9: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/bucketing.py:10: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/feature.py:239: error: Incompatible types in assignment (expression has type "Dict[str, FeatureType]", target has type "SerializableData")  [assignment]
explainaboard/utils/agreement_test.py:7: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/utils/typing_utils_test.py:10: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/serialization/serializers.py:53: error: Incompatible return value type (got "Union[List[Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]], Tuple[Union[None, bool, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable], ...]]", expected "Union[None, bool, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]")  [return-value]
explainaboard/serialization/serializers.py:53: error: Generator has incompatible item type "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"; expected "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [misc]
explainaboard/serialization/serializers.py:89: error: Incompatible return value type (got "Union[List[Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]], Tuple[Union[None, bool, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]], ...]]", expected "Union[None, bool, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]")  [return-value]
explainaboard/serialization/serializers.py:89: error: Generator has incompatible item type "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"; expected "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [misc]
explainaboard/utils/tensor_analysis.py:12: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/metric.py:10: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/metric.py:11: error: Cannot find implementation or library stub for module named "scipy.stats"  [import]
explainaboard/metrics/metric.py:178: error: Dict entry 0 has incompatible type "str": "Dict[str, MetricValue]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/metrics/metric.py:196: error: Argument 1 to "MetricResult" has incompatible type "Dict[str, Union[None, bool, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]]"; expected "Dict[str, MetricValue]"  [arg-type]
explainaboard/third_party/text_to_sql_test_suit_eval/process_sql.py:30: error: Cannot find implementation or library stub for module named "nltk"  [import]
explainaboard/utils/tokenizer.py:15: error: Cannot find implementation or library stub for module named "sacrebleu.tokenizers"  [import]
explainaboard/utils/tokenizer.py:16: error: Cannot find implementation or library stub for module named "sacrebleu.tokenizers.tokenizer_intl"  [import]
explainaboard/utils/tokenizer.py:17: error: Cannot find implementation or library stub for module named "sacrebleu.tokenizers.tokenizer_ja_mecab"  [import]
explainaboard/utils/tokenizer.py:18: error: Cannot find implementation or library stub for module named "sacrebleu.tokenizers.tokenizer_zh"  [import]
explainaboard/metrics/continuous.py:8: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/metric_test.py:10: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/external_eval.py:8: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/meta_evaluation.py:8: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/meta_evaluation.py:9: error: Cannot find implementation or library stub for module named "scipy"  [import]
explainaboard/analysis/feature_test.py:69: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/feature_test.py:134: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/feature_test.py:205: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:230: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:231: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:232: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:233: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, Collection[str]]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:234: error: List item 0 has incompatible type "Dict[str, object]"; expected "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [list-item]
explainaboard/serialization/serializers_test.py:234: error: List item 1 has incompatible type "Dict[str, object]"; expected "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [list-item]
explainaboard/serialization/serializers_test.py:234: error: List item 2 has incompatible type "Dict[str, object]"; expected "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [list-item]
explainaboard/serialization/serializers_test.py:235: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Tuple[Dict[str, object], Dict[str, object], Dict[str, object]]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:237: error: Dict entry 0 has incompatible type "str": "Dict[str, object]"; expected "str": "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [dict-item]
explainaboard/serialization/serializers_test.py:237: error: Dict entry 1 has incompatible type "str": "Dict[str, object]"; expected "str": "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [dict-item]
explainaboard/serialization/serializers_test.py:237: error: Dict entry 2 has incompatible type "str": "Dict[str, object]"; expected "str": "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [dict-item]
explainaboard/serialization/serializers_test.py:240: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:241: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:242: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:243: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, Collection[str]]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:244: error: List item 0 has incompatible type "Dict[str, object]"; expected "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [list-item]
explainaboard/serialization/serializers_test.py:244: error: List item 1 has incompatible type "Dict[str, object]"; expected "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [list-item]
explainaboard/serialization/serializers_test.py:244: error: List item 2 has incompatible type "Dict[str, object]"; expected "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [list-item]
explainaboard/serialization/serializers_test.py:245: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Tuple[Dict[str, object], Dict[str, object], Dict[str, object]]"; expected "PrimitiveData"  [arg-type]
explainaboard/serialization/serializers_test.py:247: error: Dict entry 0 has incompatible type "str": "Dict[str, object]"; expected "str": "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [dict-item]
explainaboard/serialization/serializers_test.py:247: error: Dict entry 1 has incompatible type "str": "Dict[str, object]"; expected "str": "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [dict-item]
explainaboard/serialization/serializers_test.py:247: error: Dict entry 2 has incompatible type "str": "Dict[str, object]"; expected "str": "Union[None, int, float, str, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]"  [dict-item]
explainaboard/metrics/eaas.py:9: error: Cannot find implementation or library stub for module named "eaas.async_client"  [import]
explainaboard/metrics/eaas.py:10: error: Cannot find implementation or library stub for module named "eaas.config"  [import]
explainaboard/metrics/eaas.py:11: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/eaas.py:12: error: Cannot find implementation or library stub for module named "sacrebleu"  [import]
explainaboard/metrics/eaas.py:13: error: Cannot find implementation or library stub for module named "sacrebleu.metrics.base"  [import]
explainaboard/metrics/eaas.py:13: error: Cannot find implementation or library stub for module named "sacrebleu.metrics"  [import]
explainaboard/metrics/ranking.py:9: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/performance.py:51: error: Dict entry 1 has incompatible type "str": "List[int]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/performance.py:52: error: Dict entry 2 has incompatible type "str": "Dict[str, MetricResult]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/performance.py:72: error: Argument 1 to "float" has incompatible type "Union[str, None, int, float, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"; expected "Union[SupportsFloat, SupportsIndex, str, bytes, bytearray, memoryview, array[Any], mmap, _CData, PickleBuffer]"  [arg-type]
explainaboard/analysis/performance.py:73: error: Argument 1 to "float" has incompatible type "Union[str, None, int, float, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"; expected "Union[SupportsFloat, SupportsIndex, str, bytes, bytearray, memoryview, array[Any], mmap, _CData, PickleBuffer]"  [arg-type]
explainaboard/metrics/log_prob.py:7: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/accuracy.py:8: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/external_eval_test.py:7: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/performance_test.py:219: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/performance_test.py:241: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/metrics/qa_table_text_hybrid.py:10: error: Cannot find implementation or library stub for module named "numpy"  [import]
integration_tests/meta_eval_nlg_test.py:5: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/accuracy_test.py:7: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/analyses.py:12: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/analyses.py:245: error: Dict entry 0 has incompatible type "str": "List[BucketPerformance]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/analyses.py:446: error: Dict entry 0 has incompatible type "str": "List[BucketPerformance]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/analyses.py:563: error: Argument "bucket_setting" to "__call__" of "BucketingFn" has incompatible type "List[Tuple[float, float]]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/analyses.py:563: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/analysis/analyses.py:563: note: Consider using "Sequence" instead, which is covariant
explainaboard/analysis/analyses.py:658: error: Dict entry 2 has incompatible type "str": "List[int]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/analyses.py:722: error: Dict entry 1 has incompatible type "str": "List[ComboOccurence]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/analyses.py:841: error: Dict entry 1 has incompatible type "str": "Dict[str, FeatureType]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/analyses.py:842: error: Dict entry 2 has incompatible type "str": "Dict[str, MetricConfig]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/metrics/extractive_qa.py:11: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/analysis/analyses_test.py:90: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, Collection[str]]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/analyses_test.py:237: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "List[BucketPerformance]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/analyses_test.py:237: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/analysis/analyses_test.py:237: note: Consider using "Sequence" instead, which is covariant
explainaboard/analysis/analyses_test.py:266: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/analyses_test.py:280: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/analyses_test.py:321: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "List[ComboOccurence]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/analyses_test.py:321: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/analysis/analyses_test.py:321: note: Consider using "Sequence" instead, which is covariant
explainaboard/analysis/analyses_test.py:328: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, Union[Sequence[str], None, int, float, List[PrimitiveData], Tuple[PrimitiveData, ...], Dict[str, PrimitiveData]]]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/analyses_test.py:350: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/analyses_test.py:477: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "List[BucketPerformance]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/analyses_test.py:477: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/analysis/analyses_test.py:477: note: Consider using "Sequence" instead, which is covariant
explainaboard/analysis/analyses_test.py:507: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, object]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/analyses_test.py:518: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "Dict[str, FeatureType]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/analyses_test.py:518: note: "Dict" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/analysis/analyses_test.py:518: note: Consider using "Mapping" instead, which is covariant in the value type
explainaboard/analysis/analyses_test.py:519: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "Dict[str, MetricConfig]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/analyses_test.py:519: note: "Dict" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/analysis/analyses_test.py:519: note: Consider using "Mapping" instead, which is covariant in the value type
explainaboard/analysis/result.py:33: error: Dict entry 0 has incompatible type "str": "Dict[str, Dict[str, MetricResult]]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/analysis/result.py:34: error: Dict entry 1 has incompatible type "str": "List[AnalysisResult]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/loaders/file_loader.py:15: error: Cannot find implementation or library stub for module named "datalabs"  [import]
explainaboard/loaders/file_loader.py:16: error: Cannot find implementation or library stub for module named "datalabs.features.features"  [import]
explainaboard/loaders/file_loader.py:212: error: Incompatible default for argument "fields" (default has type "None", argument has type "List[FileLoaderField]")  [assignment]
explainaboard/loaders/file_loader.py:212: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
explainaboard/loaders/file_loader.py:212: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
explainaboard/loaders/file_loader.py:475: error: Incompatible default for argument "fields" (default has type "None", argument has type "List[FileLoaderField]")  [assignment]
explainaboard/loaders/file_loader.py:475: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
explainaboard/loaders/file_loader.py:475: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
explainaboard/loaders/file_loader.py:522: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/analysis/result_test.py:35: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "Dict[str, Dict[str, MetricResult]]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/result_test.py:36: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "List[AnalysisResult]"; expected "SerializableData"  [arg-type]
explainaboard/analysis/result_test.py:36: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/analysis/result_test.py:36: note: Consider using "Sequence" instead, which is covariant
explainaboard/third_party/text_to_sql_test_suit_eval/exec_eval.py:11: error: Library stubs not installed for "tqdm"  [import]
explainaboard/info.py:186: error: Dict entry 11 has incompatible type "str": "List[AnalysisLevel]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/info.py:187: error: Dict entry 12 has incompatible type "str": "List[Analysis]"; expected "str": "Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]"  [dict-item]
explainaboard/info.py:260: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, Union[None, int, float, str, List[SerializableData], Tuple[SerializableData, ...], Dict[str, SerializableData], Serializable]]"; expected "PrimitiveData"  [arg-type]
explainaboard/analysis/feature_funcs.py:8: error: Cannot find implementation or library stub for module named "lexicalrichness"  [import]
explainaboard/analysis/feature_funcs.py:8: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
explainaboard/analysis/feature_funcs.py:9: error: Cannot find implementation or library stub for module named "sacrebleu"  [import]
explainaboard/meta_analyses/ranking.py:8: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/meta_analyses/ranking.py:9: error: Cannot find implementation or library stub for module named "pandas"  [import]
explainaboard/metrics/f1_score.py:9: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/processors/processor.py:9: error: Cannot find implementation or library stub for module named "eaas.async_client"  [import]
explainaboard/processors/processor.py:10: error: Cannot find implementation or library stub for module named "eaas.config"  [import]
explainaboard/processors/sequence_labeling.py:43: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/processors/argument_pair_extraction.py:34: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/processors/qa_tat.py:7: error: Cannot find implementation or library stub for module named "datalabs"  [import]
explainaboard/processors/language_modeling.py:8: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/processors/conditional_generation.py:9: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/processors/cloze_generative.py:8: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/processors/summarization.py:8: error: Cannot find implementation or library stub for module named "datalabs.operations.featurize.plugins.summarization.sum_attribute"  [import]
integration_tests/summarization_test.py:7: error: Cannot find implementation or library stub for module named "numpy"  [import]
integration_tests/meta_eval_wmt_da_test.py:7: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/text_to_sql.py:11: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/metrics/f1_score_test.py:7: error: Cannot find implementation or library stub for module named "sklearn.metrics"  [import]
explainaboard/visualizers/draw_charts.py:24: error: Cannot find implementation or library stub for module named "matplotlib"  [import]
explainaboard/visualizers/draw_charts.py:25: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/info_test.py:116: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "List[AnalysisLevel]"; expected "SerializableData"  [arg-type]
explainaboard/info_test.py:116: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/info_test.py:116: note: Consider using "Sequence" instead, which is covariant
explainaboard/info_test.py:117: error: Argument 1 to "serialize" of "PrimitiveSerializer" has incompatible type "List[Analysis]"; expected "SerializableData"  [arg-type]
explainaboard/info_test.py:117: note: "List" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
explainaboard/info_test.py:117: note: Consider using "Sequence" instead, which is covariant
explainaboard/info_test.py:160: error: Argument 1 to "deserialize" of "PrimitiveSerializer" has incompatible type "Dict[str, Union[Collection[str], None, int, float, List[PrimitiveData], Tuple[PrimitiveData, ...]]]"; expected "PrimitiveData"  [arg-type]
integration_tests/metric_test.py:6: error: Cannot find implementation or library stub for module named "eaas"  [import]
integration_tests/metric_test.py:7: error: Cannot find implementation or library stub for module named "eaas.async_client"  [import]
integration_tests/metric_test.py:9: error: Cannot find implementation or library stub for module named "numpy"  [import]
explainaboard/explainaboard_main.py:10: error: Cannot find implementation or library stub for module named "eaas.endpoint"  [import]
explainaboard/explainaboard_main.py:10: error: Cannot find implementation or library stub for module named "eaas"  [import]
explainaboard/explainaboard_main.py:89: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:90: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:91: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:92: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:93: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:94: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:364: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:365: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:367: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:368: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:369: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:370: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:371: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:390: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:401: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:402: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:403: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:404: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:405: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:406: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:407: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:408: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
explainaboard/explainaboard_main.py:499: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
integration_tests/cli_test.py:10: error: Cannot find implementation or library stub for module named "datalabs"  [import]
Found 141 errors in 59 files (checked 231 source files)

opened by tetsuok 0

add_tasks.md is out of date
It seems add_tasks.md is out of date. add_tasks.md mentions tasks.py in three places below:

https://github.com/neulab/ExplainaBoard/blame/fcedd5d7aab172b943c6b0025685b09744f149fd/docs/add_new_tasks.md#L6

https://github.com/neulab/ExplainaBoard/blame/fcedd5d7aab172b943c6b0025685b09744f149fd/docs/add_new_tasks.md#L12

https://github.com/neulab/ExplainaBoard/blame/fcedd5d7aab172b943c6b0025685b09744f149fd/docs/add_new_tasks.md#L133

but the Python script was removed in #373. add_tasks.md needs to be updated properly.
opened by tetsuok 0
Add system metadata class
Processor.process() takes metadata, which is used to directly initialize SysOutputInfo. However, these are essentially different data (especially, "metadata" $\subset$ SysOutputInfo, but not $=$) and the current implementation makes some confusion around this:

The most significant abuse around this behavior is that FileLoaderMetadata is implicitly converted into SysOutputInfo. This shouldn't work unless explicit conversion: https://github.com/neulab/ExplainaBoard/blob/4cec0a01cbe2617e9a67a440be25ee4252f792b2/integration_tests/ner_test.py#L148-L154

To this end, we need:

A struct defining the system metadata.

Change the behavior of Processor to take the system metadata, not a dict.

Either:

A conversion method between system metadata and FileLoaderReturn/SysOutputInfo

Include system metadata as a direct member of FileLoaderReturn/SysOutputInfo
opened by odashi 3
Reconsider default number of buckets

Currently the default number of buckets is 4: https://github.com/neulab/ExplainaBoard/blob/38db95801cbd15e2e9b2db7b60c40bd7173e1deb/explainaboard/analysis/analyses.py#L117

But this is probably too few when we're doing discrete bucketing. It'd probably be better to have the default be 4 for continuous and more (maybe 10) for discrete bucketing.

opened by neubig 0

Releases(v0.8.5)

v0.8.5(Apr 2, 2022)
This release:

Refactors the metrics class and the report structure.

Adds significance tests to all metrics.

Does major code style improvements and adds type checking.

Fixes several bugs.

Source code(tar.gz)
Source code(zip)

An Explainable Leaderboard for NLP

Related tags

Overview

ExplainaBoard: An Explainable Leaderboard for NLP

Introduction

ExplainaBoard is an interpretable, interactive and reliable leaderboard with seven (so far) new features (F) compared with generic leaderboard.

Website

Task

Download System Outputs

Test Your Results

Description of Each Directory

Submit Your Results

Acknowledgement

Comments

Repro steps

Output

Overview

Details

References

Releases(v0.8.5)

v0.8.5(Apr 2, 2022)

Owner

NeuLab

Community and sentiment analysis based on tweets

📔️ Generate a text-based journal from a template file.

A flask application to predict the speech emotion of any .wav file.

Quantifiers and Negations in RE Documents

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

ASCEND Chinese-English code-switching dataset

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Topic Inference with Zeroshot models

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Code Generation using a large neural network called GPT-J

Open source annotation tool for machine learning practitioners.

An evaluation toolkit for voice conversion models.

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Fuzzy String Matching in Python

Library for fast text representation and classification.

Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Multi Task Vision and Language

Modified GPT using average pooling to reduce the softmax attention memory constraints.