Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

Last update: Dec 06, 2022

Related tags

Overview

anlp21

Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Notebook	Description
1.words/EvaluateTokenizationForSentiment	The impact of tokenization choices on sentiment classification.
1.words/ExploreTokenization	Different methods for tokenizing texts (whitespace, NLTK, spacy, regex)
1.words/TokenizePrintedBooks	Design a better tokenizer for printed books
1.words/Text_Complexity	Implement type-token ratio and Flesch-Kincaid Grade Level scores for text
2.compare/ChiSquare, Mann-Whitney Tests	Explore two tests for finding distinctive terms
2.compare/Log-odds ratio with priors	Implement the log-odds ratio with an informative (and uninformative) Dirichlet prior
3.dictionaries/DictionaryTimeSeries	Plot sentiment over time using human-defined dictionaries
3.dictionaries/Empath	Explore using Empath dictionaries to characterize texts
4.embeddings/DistributionalSimilarity	Explore distributional hypothesis to build high-dimensional, sparse representations for words
4.embeddings/WordEmbeddings	Explore word embeddings using Gensim
4.embeddings/Semaxis	Implement SemAxis for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold),
4.embeddings/BERT	Explore the basics of token representations in BERT and use it to find token nearest neighbors
4.embedings/SequenceEmbeddings	Use sequence embeddings to find TV episode summaries most similar to a short description
5.eda/WordSenseClustering	Inferring distinct word senses using KMeans clustering over BERT representations
5.eda/Haiku KMeans	Explore text representation in clustering by trying to group haiku and non-haiku poems into two distinct clusters

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

Related tags

Overview

anlp21

Owner

David Bamman

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Understand Text Summarization and create your own summarizer in python

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

🏖 Easy training and deployment of seq2seq models.

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

The RWKV Language Model

lightweight, fast and robust columnar dataframe for data analytics with online update

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Use fastai-v2 with HuggingFace's pretrained transformers

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

texlive expressions for documents

HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

kochat

NLP command-line assistant powered by OpenAI