Contains links to publicly available datasets for modeling health outcomes using speech and language.

Overview

speech-nlp-datasets

Contains links to publicly available datasets for modeling various health outcomes using speech and language.

Speech-based Corpora

TalkBank Project

  • [Corpus] CHILDES Database
    Contains speech of children with different conditions (e.g. Autism, Down's syndrome, hearing impairment) and across different languages (e.g. English, Dutch, Greek, Mandarin).
    MacWhinney, B. (2014). The CHILDES project: Tools for analyzing talk, Volume II: The database. Psychology Press.

  • [Corpus] DementiaBank (from TalkBank)
    Contains recordings of individuals with dementia across different languages. Includes around 400 subjects, most notable in size and containing control subjects is:

    • English Pitt: Longitudinal neuropsychological assessments of 319 subjects (dementia + control) performing Cookie Theft, Word Fluency, Story Recall, and Sentence Construction task. (Becker et al., 1994)
  • [Corpus] Clinical TalkBank
    In addition to DementiaBank, TalkBank contains:

    • RHDBank individuals with Right-Hemisphere Disorder
    • TBIBank individuals with Traumatic Brain Injury
    • AphasiaBank a communication disorder affecting ability to speak, write, and understand language due to some trauma to language parts of the brain.
    • FluencyBank contains individuals with language disfluencies due to being a second language learner, or due to stuttering.

Text-based Corpora

  • [Corpus] Reddit Self-reported Depression Diagnosis (RSDD) dataset
    Contains Reddit posts for ~9,000 users with a claim to depression and ~107,000 control users. (Yates et al., (2017))

  • [Corpus] MIMIC III (Medical Information Mart for Intensive Care)
    Contains medical details and outcomes of 40,000+ patients (e.g. demographics, vital signs, laboratory tests, medications) as well as 2M+ free-text written medical notes from medical personnel (e.g. physicians, nurses, etc.). (Johnson et al., (2016)).

  • i2b2/UTHealth NLP Task (contact authors for corpus?)
    Contains emergency medical records for 296 patients at Partners HealthCare and medical discharge and correspondance notes between medical personnel. Kumar et al., (2014) describes how the data was processed, and Stubbs et al. (2014) describes the 2014 task of identifying risk factors for heart disease over time.

  • Nun Study (contact authors for corpus?)
    Diaries of 93 nuns to used to evaluate cognitive impairment (Alzheimer's disease) in later life. Also contains neuropsychology tests and autopsy information. Study was authored by (Snowdon et al.,(1996))

Owner
Tuka Alhanai
Building technology to improve quality of life.
Tuka Alhanai
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
FastFormers - highly efficient transformer models for NLU

FastFormers FastFormers provides a set of recipes and methods to achieve highly efficient inference of Transformer models for Natural Language Underst

Microsoft 678 Jan 05, 2023
Mlcode - Continuous ML API Integrations

mlcode Basic APIs for ML applications. Django REST Application Contains REST API

Sujith S 1 Jan 01, 2022
Unlimited Call - Text Bombing Tool

FastBomber Unlimited Call - Text Bombing Tool Installation On Termux

Aryan 6 Nov 10, 2022
基于pytorch+bert的中文事件抽取

pytorch_bert_event_extraction 基于pytorch+bert的中文事件抽取,主要思想是QA(问答)。 要预先下载好chinese-roberta-wwm-ext模型,并在运行时指定模型的位置。

西西嘛呦 31 Nov 30, 2022
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Microsoft 37 Nov 29, 2022
Grover is a model for Neural Fake News -- both generation and detectio

Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks.

Rowan Zellers 856 Dec 24, 2022
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022
Pytorch version of BERT-whitening

BERT-whitening This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval". BERT-whitening is

Weijie Liu 255 Dec 27, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Understanding the Difficulty of Training Transformers

Admin Understanding the Difficulty of Training Transformers Guided by our analyses, we propose Adaptive Model Initialization (Admin), which successful

Liyuan Liu 300 Dec 29, 2022
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Journey is a NLP-Powered Developer assistant

Journey Journey is a NLP-Powered Developer assistant Using on the powerful Natural Language Processing library Mindmeld, this projects aims to assist

Christian Eilers 21 Dec 11, 2022
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Michael Gale 1 Jan 06, 2022