Skip to content

ant-research/StructuredLM_RTDT

Repository files navigation

Composition Model

This library aims to construct syntactic compositional representations for text in an unsupervised manner. The covered areas may involve interpretability, text encoders, and generative language models.

Milestones

"R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling" (ACL2021), R2D2

Proposing an unsupervised structured encoder able to compose low-level constituents into high-level constituents without gold trees. The learned trees are highly consistent with human-annotated ones. The backbone of the encoder is a neural inside algorithm with heuristic pruning, thus the time and space complexity are both in linear.

"Fast-R2D2: A Pretrained Recursive Neural Network based on Pruned CKY for Grammar Induction and Text Representation". (EMNLP2022),Fast_r2d2

Improve the heuristic pruning module used in R2D2 to model-based pruning.

"A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single/Multi-Labeled Text Classification".(ICLR 2023), self-interpretable classification

We explore the interpretability of the structured encoder and find that the induced alignment between labels and spans is highly consistent with human rationality.

"Augmenting Transformers with Recursively Composed Multi-Grained Representations". (ICLR 2024) current branch.

We reduce the space complexity of the deep inside-outside algorithm from cubic to linear and further reduce the parallel time complexity to approximately log N thanks to the new pruning algorithm proposed in this paper. Furthermore, we find that joint pre-training of Transformers and composition models can enhance a variety of NLP downstream tasks.

"Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale". (preprint)

We propose GPST, a syntactic language model which could be pre-trained on raw text efficiently without any human-annotated trees. When GPST and GPT-2 are both pre-trained on OpenWebText from scratch, GPST can outperform GPT-2 on various downstream tasks. Moreover, it significantly surpasses previous methods on generative grammar induction tasks, exhibiting a high degree of consistency with human syntax. The code will be released soon.

Setup

Compile C++ codes.

python setup.py build_ext --inplace

Dataset preprocessing

Dataset: WikiText-103

Before pre-training, we preprocess corpus by spliting raw texts to sentences, tokenizing them, and converting them into numpy format.

Split texts into sentences

python utils/data_processor.py --corput_path PATH_TO_YOUR_CORPUS --task_type split --output_path PATH_TO_SPLIT_CORPUS

Tokenize raw texts and convert them into numpy format.

python utils/dataset_builder.py

Pre-training

cd trainer

torchrun --standalone --nnodes=1 --nproc_per_node=1 r2d2+_mlm_pretrain.py --config_path ../data/en_config/r2d2+_config.json --model_type cio --parser_lr 1e-3 --corpus_path ../../corpus/PATH_TO_PREPROCESSED_CORPUS --input_type bin --vocab_path ../data/en_config --epochs 10 --output_dir ../PRETRAIN_MODEL_SAVE_DIR --min_len 2 --log_step 10 --batch_size 64 --max_batch_len 512 --save_step 2000 --cache_dir ../pretrain_cache --coeff_decline 0.00 --ascending

Downstream tasks

We have conducted downstream experiments on span-level tasks using Ontonotes 5.0, sentence-level tasks using GLUE , and structure analysis tasks using PTB. Please refer to scripts/ directory for more details.

For scripts under scripts/span_tasks/transformer, the argument passed inidicates the number of span attention layers in Transformer where we experimented with 6 and 9.

Contact

aaron.hx@antgroup.com