Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

Overview

FREE_7773

Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to achieve and, especially what is NOT, please refer to the slides in the relevant folder.

Last update: December 2021.

Notes:

  • for unforseen issues with user permissions in the AWS Academy, the original serverless deployment we explained for MLSys could not be used. While the code is still in this repo for someone who wants to try with their own account, a local Flask app serving a model is provided as an alternative in the project folder.

Prequisites: Dependencies

Different sub-projects may have different requirements, as specified in the requirements.txt files to be found in the various folders. We recommend using virtualenv to keep environments isolated, i.e. creating a new environment:

python3 -m venv venv

then activating it and installing the required dependencies:

source venv/bin/activate

pip install -r requirements.txt

Repo Structure

The repo is organized by folder: each folder contains either resources - e.g. text corpora or slides - or Python programs, divided by type.

As far as ML is concerned, language-related topics are typically covered through notebooks, MLSys-related concepts are covered through Python scripts (not surprisingly!).

Data

The folder contains some ready-made text files to experiment with some NLP techniques: these corpora are just examples, and everything can be pretty much run in the same fashion if you swap these files (and change the appropriate variables) with other textual data you like better.

MLSys

This folder contains script covering MLSys concepts: how to organize a ML project, how to publish a model in the cloud etc.. In particular:

  • serverless_101 contains a vanilla AWS Lambda endpoint computing explicitely the Y value of a regression model starting from an X input provided by the client.
  • serverless_sagemaker contains an AWS Lambda endpoint which uses a Sagemaker internal endpoint to serve a scikit-model, previously trained (why two endpoints? Check the slides!).
  • training: contains a sequence of scripts taking a program training a regression model and progressively refactoring to follow industry best-practices (i.e. using Metaflow!).

For more info on each of these topics, please see the slides and the sub-sections below; make sure you run Metaflow tutorial first if you are not familiar with Metaflow.

Training scripts

Progression of scripts training the same regression model on synthetica dataset in increasingly better programs, starting from a monolithic implementation and ending with a functionally equivalent DAG-based implementation. In particular:

  • you can run create_fake_dataset.py to generate a X,Y dataset, regression_dataset;
  • monolith.py performs all operation in a long function;
  • composable.py breaks up the monolith in smaller functions, one per core functionality, so that now composable_script acts as a high-level routine explicitely displaying the logical flow of the program;
  • small_flow.py re-factores the functional components of composable.py into steps for a Metaflow DAG, which can be run with the usual MF syntax python small_flow.py run. Please note that imports of non-standard packages now happen at the relevant steps: since MF decouples code from computation, we want to make sure all steps are as self-contained as possible, dependency-wise.
  • small_flow_sagemaker.py is the same as small_flow.py, but with an additional step, deploy_model_to_sagemaker, showing how the learned model can be first stored to S3, then used to spin up a Sagemaker endpoint, that is an internal AWS endpoint hosting automatically for us the model we just created. Serving this model is more complex than what happens in Serverless 101 (see below), so a second Serverless folder hosts the Sagemaker-compatible version of AWS lambda.

Serverless 101

The folder is a self-contained AWS Lambda that can use regression parameters learned with any of the training scripts to serve predictions from the cloud:

  • handler.py contains the business logic, inside the simple_regression function. After converting a query parameter into a new x, we calculate y using the regression equation, reading the relevant parameters from the environment (see below).
  • serverless.yml is a standard Serverless configuration file, which defines the GET endpoint we are asking AWS to create and run for us, and use environment variables to store the beta and intercept learned from training a regression model.

To deploy succeessfully, make sure to have installed Serverless, configured with your AWS credentials. Then:

  • run small_flow.py in the training folder to obtain values for BETA and INTERCEPT (or whatever linear regression you may want to run on your dataset);
  • change BETA and INTERCEPT in serverless.yml with the values just learned;
  • cd into the folder and run: serverless deploy --aws-profile myProfile
  • when deployment / update is completed, the terminal will show the cloud url where our model can be reached.

Serverless Sagemaker

The folder is a self-contained AWS Lambda that can use a model hosted on Sagemaker, such as the one deployed with small_flow_sagemaker.py, to serve prediction from the cloud. Compared to Serverless 101, the handler.py file here is not using environment variables and an explicit equation, but it is simply "passing over" the input received by the client to the internal Sagemaker endpoint hosting the model (get_response_from_sagemaker).

Also in this case you need Serverless installed and configured to be able to deploy the lambda as a cloud endpoint: once small_flow_sagemaker.py is run and the Sagemaker endpoint is live, deploying the lambda itself is done with the usual commands.

Note: Sagemaker endpoints are pretty expensive - if you are not using credits, make sure to delete the endpoint when you are done with your experiments.

Notebooks

This folder contains Python notebooks that illustrate in Python concepts discussed during the lectures. Please note that notebooks are inherently "exploratory" in nature, so they are good for interactivity and speed but they are not always the right tool for rigorous coding.

Note: most of the dependencies are pretty standard, but some of the "exotic" ones are added with inline statements to make the notebook self-contained.

Project

This folder contains two main files:

  • my_flow.py is a Metaflow version of the text classification pipeline we explained in class: while not necessarily exhaustive, it contains many of the features that the final course project should display (e.g. comments, qualitative tests, etc.). The flow ends by explictely storing the artifacts from the model we just trained.
  • my_app.py shows how to build a minimal Flask app serving predictions from the trained model. Note that the app relies on a small HTML page, while our lecture described an endpoint as a purely machine-to-machine communication (that is, outputting a JSON): both are fine for the final project, as long as you understand what the app is doing.

You can run both (my_flow.py first) by creating a separate environment with the provided requirements.txt (make sure your Metaflow setup is correct, of course).

Slides

The folder contains slides discussed during the course: while they provide a guide and a general overview of the concepts, the discussions we have during lectures are very important to put the material in the right context After the first intro part, the NLP and MLSys "curricula" relatively independent. Note that, with time, links and references may become obsolete despite my best intentions!

Playground

This folder contains simple throw-away scripts useful to test specific tools, like for example logging experiments in a remote dashboard, connecting to the cloud, etc. Script-specific info are below.

Comet playground

The file comet_playground.py is a simple adaptation of Comet onboarding script for sklearn: if run correctly, the Comet dashboard should start displaying experiments under the chosen project name.

Make sure to set COMET_API_KEY and MY_PROJECT_NAME as env variables before running the script.

Acknowledgments

Thanks to all outstanding people quoted and linked in the slides: this course is possible only because we truly stand on the shoulders of giants. Thanks also to:

  • Meninder Purewal, for being such a great, patient, witty co-teacher;
  • Patrick John Chia, for debugging sci-kit on Sagemaker and building the related flow;
  • Ciro Greco, for helping with the NLP slides and greatly improving the scholarly references;
  • Federico Bianchi and Tal Linzen, for sharing their wisdom in teaching NLP.

Additional materials

The two main topics - MLSys and NLP - are huge, and we could obviously just scratch the surface. Since it is impossible to provide extensive references here, I just picked 3 great items to start:

Contacts

For questions, feedback, comments, please drop me a message at: jacopo dot tagliabue at nyu.edu.

Owner
Jacopo Tagliabue
I failed the Turing Test once, but that was many friends ago.
Jacopo Tagliabue
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

David Zurow 22 Dec 29, 2022
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Automatically search Stack Overflow for the command you want to run

stackshell Automatically search Stack Overflow (and other Stack Exchange sites) for the command you want to ru Use the up and down arrows to change be

circuit10 22 Oct 27, 2021
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
Translates basic English sentences into the Huna language (hoo-NAH)

huna-translator The Huna Language Translates basic English sentences into the Huna language (hoo-NAH). The Huna constructed language was developed in

Miles Smith 0 Jan 20, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 37 Jan 04, 2023
Simple NLP based project without any use of AI

Simple NLP based project without any use of AI

Shripad Rao 1 Apr 26, 2022
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022