DAT4 - General Assembly's Data Science course in Washington, DC

Related tags

Deep LearningDAT4
Overview

DAT4 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (12/15/14 - 3/16/15).

Instructors: Sinan Ozdemir and Kevin Markham (Data School blog, email newsletter, YouTube channel)

Teaching Assistant: Brandon Burroughs

Office hours: 1-3pm on Saturday and Sunday (Starbucks at 15th & K), 5:15-6:30pm on Monday (GA)

Course Project information

Monday Wednesday
12/15: Introduction 12/17: Python
12/22: Getting Data 12/24: No Class
12/29: No Class 12/31: No Class
1/5: Git and GitHub 1/7: Pandas
Milestone: Question and Data Set
1/12: Numpy, Machine Learning, KNN 1/14: scikit-learn, Model Evaluation Procedures
1/19: No Class 1/21: Linear Regression
1/26: Logistic Regression,
Preview of Other Models
1/28: Model Evaluation Metrics
Milestone: Data Exploration and Analysis Plan
2/2: Working a Data Problem 2/4: Clustering and Visualization
Milestone: Deadline for Topic Changes
2/9: Naive Bayes 2/11: Natural Language Processing
2/16: No Class 2/18: Decision Trees
Milestone: First Draft
2/23: Ensembling 2/25: Databases and MapReduce
3/2: Recommenders 3/4: Advanced scikit-learn
Milestone: Second Draft (Optional)
3/9: Course Review 3/11: Project Presentations
3/16: Project Presentations

Installation and Setup

  • Install the Anaconda distribution of Python 2.7x.
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "DAT4 team" and add your photo!

Class 1: Introduction

  • Introduction to General Assembly
  • Course overview: our philosophy and expectations (slides)
  • Data science overview (slides)
  • Tools: check for proper setup of Anaconda, overview of Slack

Homework:

  • Resolve any installation issues before next class.

Optional:

Class 2: Python

Homework:

Optional:

Resources:

Class 3: Getting Data

Homework:

  • Think about your project question, and start looking for data that will help you to answer your question.
  • Prepare for our next class on Git and GitHub:
    • You'll need to know some command line basics, so please work through GA's excellent command line tutorial and then take this brief quiz.
    • Check for proper setup of Git by running git clone https://github.com/justmarkham/DAT-project-examples.git. If that doesn't work, you probably need to install Git.
    • Create a GitHub account. (You don't need to download anything from GitHub.)

Optional:

  • If you aren't feeling comfortable with the Python we've done so far, keep practicing using the resources above!

Resources:

Class 4: Git and GitHub

  • Special guest: Nick DePrey presenting his class project from DAT2
  • Git and GitHub (slides)

Homework:

  • Project milestone: Submit your question and data set to your folder in DAT4-students before class on Wednesday! (This is a great opportunity to practice writing Markdown and creating a pull request.)

Optional:

  • Clone this repo (DAT4) for easy access to the course files.

Resources:

Class 5: Pandas

Homework:

Optional:

Resources:

  • For more on Pandas plotting, read the visualization page from the official Pandas documentation.
  • To learn how to customize your plots further, browse through this notebook on matplotlib.
  • To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and Columbia's Data Mining class has an excellent slide deck.

Class 6: Numpy, Machine Learning, KNN

  • Numpy (code)
  • "Human learning" with iris data (code, solution)
  • Machine Learning and K-Nearest Neighbors (slides)

Homework:

  • Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Wednesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
    • In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
    • In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
    • In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
    • How does the choice of K affect model bias? How about variance?
    • As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
    • Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
    • Does a high value for K cause over-fitting or under-fitting?

Resources:

Class 7: scikit-learn, Model Evaluation Procedures

Homework:

Optional:

  • Practice what we learned in class today!
    • If you have gathered your project data already: Try using KNN for classification, and then evaluate your model. Don't worry about using all of your features, just focus on getting the end-to-end process working in scikit-learn. (Even if your project is regression instead of classification, you can easily convert a regression problem into a classification problem by converting numerical ranges into categories.)
    • If you don't yet have your project data: Pick a suitable dataset from the UCI Machine Learning Repository, try using KNN for classification, and evaluate your model. The Glass Identification Data Set is a good one to start with.
    • Either way, you can submit your commented code to DAT4-students, and we'll give you feedback.

Resources:

Class 8: Linear Regression

Homework:

Optional:

  • Similar to last class, your optional exercise is to practice what we have been learning in class, either on your project data or on another dataset.

Resources:

Class 9: Logistic Regression, Preview of Other Models

Resources:

Class 10: Model Evaluation Metrics

  • Finishing model evaluation procedures (slides, code)
    • Review of test set approach
    • Cross-validation
  • Model evaluation metrics (slides)
    • Regression:
      • Root Mean Squared Error (code)
    • Classification:

Homework:

Optional:

Resources:

Class 11: Working a Data Problem

  • Today we will work on a real world data problem! Our data is stock data over 7 months of a fictional company ZYX including twitter sentiment, volume and stock price. Our goal is to create a predictive model that predicts forward returns.

  • Project overview (slides)

    • Be sure to read documentation thoroughly and ask questions! We may not have included all of the information you need...

Class 12: Clustering and Visualization

  • The slides today will focus on our first look at unsupervised learning, K-Means Clustering!
  • The code for today focuses on two main examples:
    • We will investigate simple clustering using the iris data set.
    • We will take a look at a harder example, using Pandora songs as data. See data.

Homework:

  • Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Monday. Here are some questions to think about while you read:
    • Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
    • Before he tried the "statistical approach" to spam filtering, what was his approach?
    • How exactly does his statistical filtering system work?
    • What did Paul say were some of the benefits of the statistical approach?
    • How good was his prediction of the "spam of the future"?
  • Below are the foundational topics upon which Monday's class will depend. Please review these materials before class:
    • Confusion matrix: Kevin's guide roughly mirrors the lecture from class 10.
    • Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
    • Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
  • You should definitely be working on your project! Your rough draft is due in two weeks!

Resources:

Class 13: Naive Bayes

Resources:

Homework:

  • Download all of the NLTK collections.
    • In Python, use the following commands to bring up the download menu.
    • import nltk
    • nltk.download()
    • Choose "all".
    • Alternatively, just type nltk.download('all')
  • Install two new packages: textblob and lda.
    • Open a terminal or command prompt.
    • Type pip install textblob and pip install lda.

Class 14: Natural Language Processing

  • Overview of Natural Language Processing (slides)
  • Real World Examples
  • Natural Language Processing (code)
  • NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, LDA, document summarization
  • Alternative: TextBlob

Resources:

Class 15: Decision Trees

Homework:

  • By next Wednesday (before class), review the project drafts of your two assigned peers according to these guidelines. You should upload your feedback as a Markdown (or plain text) document to the "reviews" folder of DAT4-students. If your last name is Smith and you are reviewing Jones, you should name your file smith_reviews_jones.md.

Resources:

Installing Graphviz (optional):

  • Mac:
  • Windows:
    • Download and install MSI file
    • Add it to your Path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin

Class 16: Ensembling

Resources:

Class 17: Databases and MapReduce

Resources:

Class 18: Recommenders

  • Recommendation Engines slides
  • Recommendation Engine Example code

Resources:

Class 19: Advanced scikit-learn

Homework:

Resources:

Class 20: Course Review

Resources:

Class 21: Project Presentations

Class 22: Project Presentations

Owner
Kevin Markham
Founder of Data School
Kevin Markham
Python script to download the celebA-HQ dataset from google drive

download-celebA-HQ Python script to download and create the celebA-HQ dataset. WARNING from the author. I believe this script is broken since a few mo

133 Dec 21, 2022
VQGAN+CLIP Colab Notebook with user-friendly interface.

VQGAN+CLIP and other image generation system VQGAN+CLIP Colab Notebook with user-friendly interface. Latest Notebook: Mse regulized zquantize Notebook

Justin John 227 Jan 05, 2023
[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

TorchSemiSeg [CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision by Xiaokang Chen1, Yuhui Yuan2, Gang Zeng1, Jingdong Wang

Chen XiaoKang 387 Jan 08, 2023
SIEM Logstash parsing for more than hundred technologies

LogIndexer Pipeline Logstash Parsing Configurations for Elastisearch SIEM and OpenDistro for Elasticsearch SIEM Why this project exists The overhead o

146 Dec 29, 2022
Pytorch implementation for reproducing StackGAN_v2 results in the paper StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

StackGAN-v2 StackGAN-v1: Tensorflow implementation StackGAN-v1: Pytorch implementation Inception score evaluation Pytorch implementation for reproduci

Han Zhang 809 Dec 16, 2022
Official project repository for 'Normality-Calibrated Autoencoder for Unsupervised Anomaly Detection on Data Contamination'

NCAE_UAD Official project repository of 'Normality-Calibrated Autoencoder for Unsupervised Anomaly Detection on Data Contamination' Abstract In this p

Jongmin Andrew Yu 2 Feb 10, 2022
3D dataset of humans Manipulating Objects in-the-Wild (MOW)

MOW dataset [Website] This repository maintains our 3D dataset of humans Manipulating Objects in-the-Wild (MOW). The dataset contains 512 images in th

Zhe Cao 28 Nov 06, 2022
Official code for "EagerMOT: 3D Multi-Object Tracking via Sensor Fusion" [ICRA 2021]

EagerMOT: 3D Multi-Object Tracking via Sensor Fusion Read our ICRA 2021 paper here. Check out the 3 minute video for the quick intro or the full prese

Aleksandr Kim 276 Dec 30, 2022
This is an implementation of PIFuhd based on Pytorch

Open-PIFuhd This is a unofficial implementation of PIFuhd PIFuHD: Multi-Level Pixel-Aligned Implicit Function forHigh-Resolution 3D Human Digitization

Lingteng Qiu 235 Dec 19, 2022
Object detection GUI based on PaddleDetection

PP-Tracking GUI界面测试版 本项目是基于飞桨开源的实时跟踪系统PP-Tracking开发的可视化界面 在PaddlePaddle中加入pyqt进行GUI页面研发,可使得整个训练过程可视化,并通过GUI界面进行调参,模型预测,视频输出等,通过多种类型的识别,简化整体预测流程。 GUI界面

杨毓栋 68 Jan 02, 2023
공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다.

ObsCare_Main 소개 공공장소에서 눈만 돌리면 CCTV가 보인다는 말이 과언이 아닐 정도로 CCTV가 우리 생활에 깊숙이 자리 잡았습니다. CCTV의 대수가 급격히 늘어나면서 관리와 효율성 문제와 더불어, 곳곳에 설치된 CCTV를 개별 관제하는 것으로는 응급 상

5 Jul 07, 2022
The undersampled DWI image using Slice-Interleaved Diffusion Encoding (SIDE) method can be reconstructed by the UNet network.

UNet-SIDE The undersampled DWI image using Slice-Interleaved Diffusion Encoding (SIDE) method can be reconstructed by the UNet network. For Super Reso

TIANTIAN XU 1 Jan 13, 2022
🕺Full body detection and tracking

Pose-Detection 🤔 Overview Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign

Abbas Ataei 20 Nov 21, 2022
Ranking Models in Unlabeled New Environments (iccv21)

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

14 Dec 17, 2021
Learning Compatible Embeddings, ICCV 2021

LCE Learning Compatible Embeddings, ICCV 2021 by Qiang Meng, Chixiang Zhang, Xiaoqiang Xu and Feng Zhou Paper: Arxiv We cannot release source codes pu

Qiang Meng 25 Dec 17, 2022
PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning Warning: This is a rapidly evolving research prototype.

MIT Probabilistic Computing Project 190 Dec 27, 2022
Stochastic gradient descent with model building

Stochastic Model Building (SMB) This repository includes a new fast and robust stochastic optimization algorithm for training deep learning models. Th

S. Ilker Birbil 22 Jan 19, 2022
Efficient Multi Collection Style Transfer Using GAN

Proposed a new model that can make style transfer from single style image, and allow to transfer into multiple different styles in a single model.

Zhaozheng Shen 2 Jan 15, 2022
Towhee is a flexible machine learning framework currently focused on computing deep learning embeddings over unstructured data.

Towhee is a flexible machine learning framework currently focused on computing deep learning embeddings over unstructured data.

1.7k Jan 08, 2023
[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".

SMYRF: Efficient attention using asymmetric clustering Get started: Abstract We propose a novel type of balanced clustering algorithm to approximate a

Giannis Daras 46 Dec 22, 2022