GitHub - decile-team/spear at zzun.app

Semi-Supervised Data Programming for Data Efficient Machine Learning

SPEAR is a library for data programming with semi-supervision. The package implements several recent data programming approaches including facility to programmatically label and build training data.

Pipeline

Design Labeling functions(LFs)
generate pickle file containing labels by passing raw data to LFs
Use one of the Label Aggregators(LA) to get final labels

SPEAR provides functionality such as

development of LFs/rules/heuristics for quick labeling
compare against several data programming approaches
compare against semi-supervised data programming approaches
use subset selection to make best use of the annotation efforts
facility to store and save data in pickle file

Labelling Functions (LFs)

discrete LFs - Users can define LFs that return discrete labels
continuous LFs - return continuous scores/confidence to the labels assigned

Approaches Implemented

You can read this paper to know about below approaches

Only-L
Learning to Reweight
Posterior Regularization
Imply Loss
CAGE
Joint Learning

Data folder for SMS & TREC can be found here. This folder needs to be placed in the same directory as notebooks folder is in, to run the notebooks or examples.

Direct download of the zip file can be done via wget using gdown library .

pip install gdown
gdown 1CJZ73nNa7Ho0BOSDgGx9CRvXoepVSpet

Installation

Install Submodlib library pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ submodlib In case of installation issues with the submodlib, please consult Submodlib Github.

Method 1

To install latest version of SPEAR package using PyPI:

pip install decile-spear

Method 2

SPEAR requires Python 3.6 or later. First install submodlib. Then install SPEAR:

git clone https://github.com/decile-team/spear.git
cd spear
pip install -r requirements/requirements.txt

Citation

@inproceedings{abhishek-etal-2022-spear,
    title = "{SPEAR} : Semi-supervised Data Programming in Python",
    author = "Abhishek, Guttu  and
      Ingole, Harshad  and
      Laturia, Parth  and
      Dorna, Vineeth  and
      Maheshwari, Ayush  and
      Ramakrishnan, Ganesh  and
      Iyer, Rishabh",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-demos.12",
    pages = "121--127",
}

Quick Links

Acknowledgment

SPEAR takes inspiration, builds upon, and uses pieces of code from several open source codebases. These include Snorkel, Snuba & Imply Loss. Also, SPEAR uses SUBMODLIB for subset selection, which is provided by DECILE too.

Team

SPEAR is created and maintained by Ayush, Abhishek, Vineeth, Harshad, Parth, Pankaj, Rishabh Iyer, and Ganesh Ramakrishnan. We look forward to have SPEAR more community driven. Please use it and contribute to it for your research, and feel free to use it for your commercial projects. We will add the major contributors here.

Related Publications

Divya Jyoti Bajpai, Ayush Maheshwari, Manjesh Kumar Hanawal, Ganesh Ramakrishnan (2024). FAIR: Filtering of Automatically Induced Rules. In EACL, 2024.
Akshat Gautam, Anurag Shandilya, Akshit Srivastava, Venkatapathy Subramanian, Ganesh Ramakrishnan, Kshitij Jadhav (2024). INSITE: labelling medical images using submodular functions and semi-supervised data programming. In ISBI, 2024.
Dhruv Kudale, Badri Vishal Kasuba, Venkatapathy Subramanian, Parag Chaudhuri, Ganesh Ramakrishnan. TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming. In WACV, 2024.
Ayush Maheshwari, Piyush Sharma, Preethi Jyothi, Ganesh Ramakrishnan (2023). DICTDIS: Dictionary Constrained Disambiguation for Improved NMT
Abhishek Singh, Venkatapathy Subramanian, Ayush Maheshwari, Pradeep Narayan, Devi Prasad Shetty, Ganesh Ramakrishnan (2023). EIGEN: Expert-Informed Joint Learning Aggregation for High-Fidelity Information Extraction from Document Images. In Proceedings of ML4Health Conference, 2023 (co-located with Neurips).
Ayush Maheshwari, Ajay Ravindran, Venkatapathy Subramanian, Ganesh Ramakrishnan (2023). UDAAN - Machine Learning based Post-Editing tool for Document Translation. Best Paper Award In CODS-COMAD 2023.
Durga S, Ayush Maheshwari, Pradeep Shenoy, Prathosh AP, Ganesh Ramakrishnan (2022). Reweighing auxiliary losses in supervised learning. In AAAI 2023.
Maheshwari et al. Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming, In Findings of ACL (Long Paper) 2022.
Maheshwari, Ayush, et al. Data Programming using Semi-Supervision and Subset Selection, In Findings of ACL (Long Paper) 2021.
Sahay, Atul, et al. Rule augmented unsupervised constituency parsing, In Findings of ACL (Short Paper) 2021.
Chatterjee, Oishik, Ganesh Ramakrishnan, and Sunita Sarawagi. Data Programming using Continuous and Quality-Guided Labeling Functions, In AAAI 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
docs		docs
examples		examples
notebooks		notebooks
requirements		requirements
spear		spear
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py
spear_logo.png		spear_logo.png
spear_logo.svg		spear_logo.svg
spear_pipeline.svg		spear_pipeline.svg

License

decile-team/spear

Folders and files

Latest commit

History

Repository files navigation

Semi-Supervised Data Programming for Data Efficient Machine Learning

Pipeline

SPEAR provides functionality such as

Labelling Functions (LFs)

Approaches Implemented

Installation

Method 1

Method 2

Citation

Quick Links

Acknowledgment

Team

Related Publications

About

Topics

Resources

License

Stars

Watchers

Forks

Languages