Crosslingual Segmental Language Model

This repository contains the code from Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages (2021, C.M. Downey, Shannon Drizin, Levon Haroutunian, and Shivin Thukral). The code here is a modified version of the repository from the original MSLM paper. The mslm package can be used to train and use Segmental Language Models.

In this repository, we additionally make available our preparation of the AmericasNLP 2021 multilingual dataset (see Data/AmericasNLP) and the target K'iche' data (Data/GlobalClassroom).

Paper Results

The results from the accompanying paper can be found in the Output directory. *.csv files include statistics from the training run, *.out contain the model output for the entire corpus, *.score contain the segmentation scores of the model output.

The results from the October 2021 pre-print (which we will refer to as Experiment Set A) are reproducible on commit 2b89575. We will consider this the official commit of the October 2021 pre-print.

Usage

The top-level scripts for training and experimentation can be found in RunScripts. Almost all functionality is run through the __main__.py script in the mslm package, which can either train or evaluate/use a model. The PyTorch modules for building SLMs can be found in mslm.segmental_lm, modules for the span-masking Transformer are in mslm.segmental_transformer, and modules for sequence lattice-based computations are in mslm.lattice. The main script takes in a configuration object to set most parameters for model training and use (see mslm.mslm_config). For information on the arguments to the main script:

python -m mslm --help

Environment setup

pip install -r requirements.txt

This code requires Python >= 3.6

Training

./RunScripts/run_mslm.sh

python -m mslm --input_file 
   
     \
    --model_path 
    
      \
    --mode train \
    --config_file 
     
       \
    --dev_file 
      
        \
    [--preexisting]

Evaluation

./RunScripts/eval_mslm.sh

Where is a text file containing all of the words from the training set

Crosslingual Segmental Language Model

Related tags

Overview

Crosslingual Segmental Language Model

Paper Results

Usage

Environment setup

Training

Evaluation

Owner

C.M. Downey

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation

A large-scale benchmark for co-optimizing the design and control of soft robots, as seen in NeurIPS 2021.

Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

A repository for generating stylized talking 3D and 3D face

Plaything for Autistic Children (demo for PaddlePaddle/Wechaty/Mixlab project)

SalFBNet: Learning Pseudo-Saliency Distribution via Feedback Convolutional Networks

Pytorch implementation of Learning with Opponent-Learning Awareness

CCAFNet: Crossflow and Cross-scale Adaptive Fusion Network for Detecting Salient Objects in RGB-D Images

Official implementation for paper Render In-between: Motion Guided Video Synthesis for Action Interpolation

A pytorch implementation of faster RCNN detection framework (Use detectron2, it's a masterpiece)

DeepConsensus uses gap-aware sequence transformers to correct errors in Pacific Biosciences (PacBio) Circular Consensus Sequencing (CCS) data.

Weight initialization schemes for PyTorch nn.Modules

ICSS - Interactive Continual Semantic Segmentation

DeepDiffusion: Unsupervised Learning of Retrieval-adapted Representations via Diffusion-based Ranking on Latent Feature Manifold

Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs

A collection of scripts I developed for personal and working projects.

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

Portfolio Optimization and Quantitative Strategic Asset Allocation in Python