Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

Last update: Dec 06, 2022

Related tags

Overview

anlp21

Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dbamman/info256.html

Notebook	Description
1.words/EvaluateTokenizationForSentiment	The impact of tokenization choices on sentiment classification.
1.words/ExploreTokenization	Different methods for tokenizing texts (whitespace, NLTK, spacy, regex)
1.words/TokenizePrintedBooks	Design a better tokenizer for printed books
1.words/Text_Complexity	Implement type-token ratio and Flesch-Kincaid Grade Level scores for text
2.compare/ChiSquare, Mann-Whitney Tests	Explore two tests for finding distinctive terms
2.compare/Log-odds ratio with priors	Implement the log-odds ratio with an informative (and uninformative) Dirichlet prior
3.dictionaries/DictionaryTimeSeries	Plot sentiment over time using human-defined dictionaries
3.dictionaries/Empath	Explore using Empath dictionaries to characterize texts
4.embeddings/DistributionalSimilarity	Explore distributional hypothesis to build high-dimensional, sparse representations for words
4.embeddings/WordEmbeddings	Explore word embeddings using Gensim
4.embeddings/Semaxis	Implement SemAxis for scoring terms along a user-defined axis (e.g., positive-negative, concrete-abstract, hot-cold),
4.embeddings/BERT	Explore the basics of token representations in BERT and use it to find token nearest neighbors
4.embedings/SequenceEmbeddings	Use sequence embeddings to find TV episode summaries most similar to a short description
5.eda/WordSenseClustering	Inferring distinct word senses using KMeans clustering over BERT representations
5.eda/Haiku KMeans	Explore text representation in clustering by trying to group haiku and non-haiku poems into two distinct clusters

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

Related tags

Overview

anlp21

Owner

David Bamman

Repository of the Code to Chatbots, developed in Python

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

A linter to manage all your python exceptions and try/except blocks (limited only for those who like dinosaurs).

Text vectorization tool to outperform TFIDF for classification tasks

The entmax mapping and its loss, a family of sparse softmax alternatives.

A simple version of DeTR

BiQE: Code and dataset for the BiQE paper

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Client library to download and publish models and other files on the huggingface.co hub

NLP topic mdel LDA - Gathered from New York Times website

An open-source NLP library: fast text cleaning and preprocessing.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Unofficial PyTorch implementation of Google AI's VoiceFilter system

A programming language with logic of Python, and syntax of all languages.

NeuTex: Neural Texture Mapping for Volumetric Neural Rendering

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.

CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

Weakly-supervised Text Classification Based on Keyword Graph