Toolkit for collecting and applying templates of prompting instances

Overview

PromptSource

Toolkit for collecting and applying templates of prompting instances.

WIP

Setup

  1. Download the repo
  2. Navigate to root directory of the repo
  3. Install requirements with pip install -r requirements.txt

Running

From the root directory of the repo, you can launch the editor with

streamlit run promptsource/promptsource.py

Writing Templates

A prompt template is expressed in Jinja.

It is rendered using an example from the corresponding Hugging Face datasets library (a dictionary). The separator ||| should appear once to divide the template into prompt and output. Generally, the prompt should provide information on the desired behavior, e.g., text passage and instructions, and the output should be a desired response.

Here's an example for AG News:

{{text}}
Is this a piece of news regarding world politics, sports, business, or technology? |||
{{ ["World politics", "Sport", "Business", "Technology"][label] }}

Contributing

This is very much a work in progress, and help is needed and appreciated. Anyone wishing to contribute code can contact Steve Bach for commit access, or submit PRs from forks. Some particular places you could start:

  1. Try to express things! Explore a dataset and tell us what's hard to do to create templates you want
  2. Look in the literature. Are there prompt creation methods that do/do not fit well right now?
  3. Scalability testing. Streamlit is lightweight, and we're reading and writing all prompts on refresh.

See also the design doc.

Before submitting a PR or pushing a new commit, please run style formattings and quality checks so that your newly added file look nice:

make style
make quality

Known Issues

Warning or Error about Darwin on OS X: Try downgrading PyArrow to 3.0.0.

Comments
  • Bias/fairness quantitative measurements

    Bias/fairness quantitative measurements

    Two goals:

    • have quantitative measurements for the paper's Broader impact section
    • reflect these in the model cards when we release checkpoints

    4 held out evaluation sets identified by the Evaluation WG:

    • jigsaw_toxicity_pred
    • crows_pairs
    • winogender (AXG in SuperGLUE)
    • winobias

    At the current state, I see crows_pairs and winobias prompted, jigsaw_toxicity_pred has an opened PR (#451) we need to check, winogender needs to be prompted.

    Workflow:

    • [x] prompt those that were not prompted yet
    • [x] making sure these were actually cached (@VictorSanh can have this caching step done fairly quickly)
    • [x] evaluation (normal and score rank evaluation) or maybe they have some special evaluation?
    • [x] when we know which checkpoints exactly to eval, get the final numbers to report
    opened by VictorSanh 22
  • remove language restrictions in tydiqa + add arabic prompts

    remove language restrictions in tydiqa + add arabic prompts

    • [x] Remove the if statement to allow English prompts to work across the Dataset.
    • [x] Add Arabic prompts for primary task subset with if statement to include only Arabic set.
    • [x] Add Arabic prompts for secondary task subset.
    opened by KhalidAlt 21
  • Tracking trainings and evals

    Tracking trainings and evals

    Main runs

    • D4 only (finetune-t5-xxl-lm-d4-091621-512)
      • [x] Training
      • [x] Eval
        • [x] Last
          • [x] Normal
          • [x] Rank
        • [x] 1'112'200 (half fine-tuning) (cf #456)
          • [x] Normal
          • [x] Rank
        • [x] 1'124'700 (second to last checkpoint)
          • [x] Normal
          • [x] Rank
      • [ ] SuperGLUE test (1380485) - RUNNING
      • [x] BigBench - see #469
    • D4 + GPT (finetune-t5-xxl-lm-d4-gpt-091621/)
      • [x] Training
      • [x] Eval
        • [x] Last
          • [x] Normal
          • [x] Rank
        • [x] 1'112'200 (half fine-tuning)
          • [x] Normal
          • [x] Rank
      • [x] BigBench - see #469 - RUNNING
    • D4 + GPT + Sglue (finetune-t5-xxl-lm-d4-all-091621)
      • [X] Training
      • [X] Eval
        • [x] Last
          • [x] Normal
          • [x] Rank
        • [x] 1'112'000 (half fine-tuning)
          • [x] Normal
          • [x] Rank
      • [x] BigBench - see #469 - RUNNING

    Ablations

    • Nb Template - 1 OG Template (finetune-t5-xxl-lm-d4-og-091621)
      • [X] Training
      • [X] Eval
        • [x] Last
          • [x] Normal
          • [x] Rank
        • [x] 1'112'000 (half fine-tuning)
          • [x] Normal
          • [x] Rank
    • all OG Templates (finetune-t5-xxl-lm-d4-og-all-091621) #465
      • [x] Training - RUNNING
      • [ ] Eval
        • [ ] Last
          • [ ] Normal
          • [ ] Rank
        • [x] 1'112'000 (half fine-tuning)
          • [ ] Normal ``
          • [x] Rank ``
    • Size XL (finetune-t5-xl-lm-d4-091621)
      • [X] Training
      • [X] Eval
        • [x] Last
          • [x] Normal
          • [x] Rank
        • [x] 1'112'000 (half fine-tuning)
          • [x] Normal
          • [x] Rank
    • D4 only duplicate (finetune-t5-xxl-lm-d4-091621/)
      • [x] Training
      • [ ] Eval
        • [ ] Last
          • [ ] Normal
          • [ ] Rank
        • [X] 1'112'000 (half fine-tuning)
          • [X] Normal
          • [X] Rank

    Baseline

    • T5 zero-shot baseline
      • [x] Eval - RUNNING under finetune-t5-xxl-lm-d4-091621
        • [x] Normal 1384362 - RUNNING
        • [x] Rank 1384360 - RUNNING
    opened by VictorSanh 18
  • Special eval metrics and scripts

    Special eval metrics and scripts

    Since we have the string outputs of all tasks, in principal we should be able to run arbitrary metrics, especially for datasets require fancy [email protected] has imported the official eval scripts for ReCoRD, SQuAD v2, Natural Questions, TriviaQA, and DROP.

    Update: Even when using Lintang's eval scripts, all extractive QAs and closed-book (generative) QAs still have abnormally low numbers, namely:

    • [ ] ReCoRD
    • [x] SQuAD2 (contains unanswerables)
    • [x] DROP
    • [ ] CoQA (contains unanswerables, multiple questions per example)
    • [ ] Quac (contains unanswerables, multiple questions per example)
    • [ ] Natural Questions
    • [x] TriviaQA
    • [x] WebQuestions

    Also, I think all eval of extractive QA from the training mixture also failed.

    (Note that ARC is closed-book, but its performance is fine because it's multiple-choice. A great point in case that machine task categories care more about format way more than human skill/knowledge.)

    Others with issues to keep an eye on:

    • [x] HellaSwag
    • [x] Lambada
    • [x] Winogrande
    evaluations 
    opened by awebson 14
  • Fix rendering: use simple `st.text` instead of `st.markdown`

    Fix rendering: use simple `st.text` instead of `st.markdown`

    The rendering is messed up in a bunch of places. A few issues that are symptomatic: #355 #326

    Replace st.markdown by st.text and extensively test that the rendering is better.

    opened by VictorSanh 13
  • Templates for `ncbi_disease`

    Templates for `ncbi_disease`

    This one compared to the others has been a real pain, but I think the templates are quite interesting. Not sure the Jinja style is top notch, but it seems functional on the example sets I checked.

    opened by drugilsberg 12
  • Creating unique identifier in the template.yaml

    Creating unique identifier in the template.yaml

    For now, it looks like we can sort of uniquely identify each template using a combination of template name and dataset name, but I'm expecting potential collisions when a lot of people start contributing. Besides, naming each template might not be useful (like if we end up with names like template1 template2 etc...), and it would help contributors if they don't have to add a name/check conflicts on the naming part before merging their template.yaml.

    I was thinking that we could add an ID to each entry by getting the hash of timestamp + dataset + string of prompt python function or jinja template? That should be more than enough to prevent collisions

    enhancement 
    opened by arnaudstiegler 12
  • Add prompts for DiaBLa dataset

    Add prompts for DiaBLa dataset

    Added a range of prompts for different tasks:

    • sentence-level translation
    • analogy-based translation
    • contextual translation (with context in different languages, automatically vs. manually translated)
    • detection of errors in translations
    • identification of machine translated output compared to human-produced translations
    opened by rbawden 11
  • AttributeError: 'NoneType' object has no attribute 'session_id'

    AttributeError: 'NoneType' object has no attribute 'session_id'

    Hi,

    Is anyone facing this issue while running streamlit run promptsource/app.py ?

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 105, in spawn_main
        exitcode = _main(fd)
      File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 114, in _main
        prepare(preparation_data)
      File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 225, in prepare
        _fixup_main_from_path(data['init_main_from_path'])
      File "C:\Users\I355109\Anaconda3\envs\python37\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
        run_name="__mp_main__")
      File "C:\Users\I355109\Anaconda3\envs\python37\lib\runpy.py", line 263, in run_path
        pkg_name=pkg_name, script_name=fname)
      File "C:\Users\I355109\Anaconda3\envs\python37\lib\runpy.py", line 96, in _run_module_code
        mod_name, mod_spec, pkg_name, script_name)
      File "C:\Users\I355109\Anaconda3\envs\python37\lib\runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "C:\Users\I355109\promptsource\promptsource\app.py", line 59, in <module>
        state = _get_state()
      File "c:\users\i355109\promptsource\promptsource\session.py", line 84, in _get_state
        session = _get_session()
      File "c:\users\i355109\promptsource\promptsource\session.py", line 74, in _get_session
        session_id = get_report_ctx().session_id
    AttributeError: 'NoneType' object has no attribute 'session_id'
    
    opened by manandey 11
  • Add conll2003 ner,pos,chunk task.

    Add conll2003 ner,pos,chunk task.

    Prompt Description:

    1. flat_question_with_label : Regular task. Label are normalized label in-case of POS tagging.
    2. flat_question_with_random_label : It is not expected that user will always provide labels strictly from the dataset. They may provide a subset of labels. So here we provided subset of labels. If the gold labels in the sample are not available in the subset we replace the gold label with "O". In case of choosing random tags, We always include "O" tag for ner, pos and chunk labels.
    3. flat_question_without_label : Regular task. No label is provided.

    POS label Normalization

    Both NER and Chunk task contains "O" tags. But POS doesn't contain "O" tag.

    In case of parts-of-speech tags, there are few labels that are weird in natural sense. For example see a prompt with all pos labels,

    Generate parts of speech from the following sentence. The parts of speech tags are ", '', #, $, (, ), ,, ., :, ``, CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNP, NNPS, NNS, NN|SYM, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB
    

    Here first 9 labels are normalized to "O" tag.

    Earlier Pull

    Earlier zip was not available so I wrote the a brute force code in O(n^2) complexity. But now that zip is available, I wrote the code with simpler notation and loop (with O(n) complexity). While merging I messed up in previous pull https://github.com/bigscience-workshop/promptsource/pull/170 . So I closed that and created the new pull.

    opened by sbmaruf 11
  • Some tweaks to the Editor

    Some tweaks to the Editor

    • Default sort by popularity and include number of prompts

    image

    • Global progress table

    image

    • Separated out new prompts and select, and the columns (I think this is okay, but might break)

    image

    • Added text wrapping so that you can view long prompts

    • Added a template viewer section

    image

    opened by srush 11
  • Merging xP3 & eval-hackathon

    Merging xP3 & eval-hackathon

    I would like to get all xP3 prompts merged into eval-hackathon & then have eval-hackathon be merged into main once & for all, but I'm not sure if you want that? cc @VictorSanh @stephenbach

    It would include adding:

    • Normal english prompts for various new & existing datasets (many PRs already open - will clean them up & request reviews when ready if someone gives me rights to do so / merge if i can & tests pass)
    • Long prompts https://github.com/bigscience-workshop/promptsource/pull/823/files ; Would request reviews once ready
    • Human-translated & Machine-translated prompts (Always suffixed with ht or mt, roughly 20K lines; see https://github.com/Muennighoff/promptsource/pull/47/files)

    Lmk if you're okay with that & I'll get started 🙂

    opened by Muennighoff 0
  • Regenerate all templates in unicode

    Regenerate all templates in unicode

    @KhalidAlt made a great suggestion to change

    yaml.dump(self.format_for_dump(), open(self.yaml_path, "w"))
    

    to

    yaml.dump(self.format_for_dump(), open(self.yaml_path, "w"), allow_unicode=True)
    

    in templates.py so that the yaml files display as unicode (rather than the unicode code points). We should definitely do this, but we should do it when the corpus is relatively stable and we can regenerate (read and write) all the yaml as one commit.

    opened by stephenbach 0
  • fix fields names for TREC after

    fix fields names for TREC after

    changes were induced here https://huggingface.co/datasets/trec/commit/1f97567bdd2adedefe8abdaa9bd6ee0e6725b458 to the field names (coarse_label and fine_label)

    opened by VictorSanh 0
Releases(v0.2.3)
Owner
BigScience Workshop
Research workshop on large language models - The Summer of Language Models 21
BigScience Workshop
cpp20.py is a Python script to compile C++20 code using modules.

cpp20.py is a Python script to compile C++20 code using modules. It browses the source files to determine their dependencies. Then, it compiles then in order using the correct flags.

Julien VERNAY 6 Aug 26, 2022
A time table app to notify the user about their class timings

kivyTimeTable A time table app to notify the user about their class timings Features This project incorporates some features i wanted to see in a time

2 Dec 15, 2021
WindowsDebloat - Windows Debloat with python

Windows Debloat 🗑️ Quickly and easily configure Windows 10. Disclaimer I am NOT

1 Mar 26, 2022
Simple Python tool that generates a pseudo-random password with numbers, letters, and special characters in accordance with password policy best practices.

Simple Python tool that generates a pseudo-random password with numbers, letters, and special characters in accordance with password policy best practices.

Joe Helle 7 Mar 25, 2022
HeadHunter parser

HHparser Description Program for finding work at HeadHunter service Features Find job Parse vacancies Dependencies python pip geckodriver firefox Inst

memphisboy 1 Oct 30, 2021
A simple dork generator written in python that outputs dorks with the domain extensions you enter

Dork Gen A simple dork generator written in python that outputs dorks with the domain extensions you enter in a ".txt file". Usage The code is pretty

Z3NToX 4 Oct 30, 2022
A multipurpose python module

pysherlock pysherlock is a Python library for dealing with web scraping using images, it's a Python application of the rendertron headless browser API

Sachit 2 Nov 11, 2021
Install, run, and update apps without root and only in your home directory

Qube Apps Install, run, and update apps in the private storage of a Qube. Build and install in Qubes Get the code: git clone https://github.com/micahf

Micah Lee 26 Dec 27, 2022
ecowater-softner is a Python library for collecting information from Ecowater water softeners.

Ecowater Softner ecowater-softner is a Python library for collecting information from Ecowater water softeners. Installation Use the package manager p

6 Dec 08, 2022
Macro recording and metaprogramming in Python

macro-kit is a package for efficient macro recording and metaprogramming in Python using abstract syntax tree (AST).

8 Aug 31, 2022
Simple collection of GTPS Flood in Python.

GTPS Flood Simple collection of GTPS Flood in Python. NOTE Give me credit if you use this source, don't trade/sell this tool, And USE AT YOUR OWN RISK

PhynX 6 Dec 07, 2021
A thing to simplify listening for PG notifications with asyncpg

asyncpg-listen This library simplifies usage of listen/notify with asyncpg: Handles loss of a connection Simplifies notifications processing from mult

ANNA 18 Dec 23, 2022
Import the module and create an object of the class LocalVariable.

LocalVariable Import the module and create an object of the class LocalVariable. Call the save method with the name and the value of a variable as arg

Sajedur Rahman Fiad 2 Dec 14, 2022
Enable ++x and --x expressions in Python

By default, Python supports neither pre-increments (like ++x) nor post-increments (like x++). However, the first ones are syntactically correct since Python parses them as two subsequent +x operation

Alexander Borzunov 85 Dec 29, 2022
A simple gpsd client and python library.

gpsdclient A small and simple gpsd client and library Installation Needs Python 3 (no other dependencies). If you want to use the library, use pip: pi

Thomas Feldmann 33 Nov 24, 2022
一款不需要买代理来减少扫网站目录被封概率的扫描器,适用于中小规格字典。

PoorScanner使用说明书 -工具在不同环境下可能不怎么稳定,如果有什么问题恳请大家反馈。说明书有什么错误的地方也大家欢迎指正。 更新记录 2021.8.23 修复了云函数主程序 gitee上传文件接口写错了的BUG(之前把自己的上传地址写死进去了,没从配置文件里读) 更新了说明书 PoorS

14 Aug 02, 2022
A simple and easy to use collection of random python functions.

A simple and easy to use collection of random python functions.

Diwan Mohamed Faheer 1 Nov 17, 2021
Find dependent python scripts of a python script in a project directory.

Find dependent python scripts of a python script in a project directory.

2 Dec 05, 2021
convert a dict-list object from / to a typed object(class instance with type annotation)

objtyping 带类型定义的对象转换器 由来 Python不是强类型语言,开发人员没有给数据定义类型的习惯。这样虽然灵活,但处理复杂业务逻辑的时候却不够方便——缺乏类型检查可能导致很难发现错误,在IDE里编码时也没

Song Hui 15 Dec 22, 2022
Similar looking domain detection using python fuzzywuzzy

Major cause of phishing and BEC incident is similar looking domain, if you detect it early, you can prevent incidents early, python fuzzywuzzy module let you do that

2 Nov 07, 2021