Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

Overview

WeRateDogs Twitter Data from 2015 to 2017

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

Table of Contents

  1. Introduction
  2. Project Overview
  3. Requirements
  4. Project Movitivation
  5. Key Files
  6. Results
  7. Licensing, Authors, and Acknowledgements

1. Introduction

Real-world data rarely comes clean. Using Python and its libraries, I gathered data from a variety of sources and in a variety of formats, assessed its quality and tidiness, then cleaned it. This is called data wrangling. I documented my wrangling efforts in a Jupyter Notebook, then showcased them through analyses and visualizations using Python and its libraries.

The dataset that I wrangled (and analyzing and visualizing) was the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent." WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

WRD_twitter_banner

2. Project Overview

Tasks in this project were as follows:

  • Step 1: Gathering data
  • Step 2: Assessing data
  • Step 3: Cleaning data
  • Step 4: Storing data
  • Step 5: Analyzing, and visualizing data
  • Step 6: Reporting
    • My data wrangling efforts
    • My data analyses and visualizations

3. Requirements

This project was created in a Jupyter Notebook made available via Anaconda and written in python.\ The following versions of languages and libraries were used in creating this project:

  • python==2.7.18
  • ipython==7.31.0
  • matplotlib==3.5.1
  • numpy==1.22.0
  • pandas==1.3.5
  • requests==2.27.1
  • scipy==1.7.3
  • seaborn==0.11.2
  • tweepy==4.4.0

4. Project Motivation

The goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

The overall purpose of this Udacity project was to refine our data wrangling skills with secondary importance on delivering multiple polished visualzations and tell a story or solve a problem. In other words, the journey was more important than the destination.

5. Key Files

  • twitter_archive_enhanced.csv
    The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which Udacity used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, only those tweets with ratings were filtered. The data was extracted programmatically by Udacity, but the data was left messy on purpose. The ratings aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. I had to assess and clean these columns to use them for analysis and visualization.

  • tweet_json.txt
    Resulting data queried using Twitter's API. It was necessary to gather the retweet count and favorite count which were omitted from the basic twitter_archive_enhanced.csv.

  • image-predictions.tsv
    Udacity ran every image in the WeRateDogs Twitter archive was through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

  • wrangle_act.ipynb
    This contains the bulk of the project. This notebook contains all code for gathering, assessing, cleaning, analyzing, and visualizing data.

  • wrangle_report.pdf
    This was a report for documenting the data wrangling process: gather, assess, and clean.

  • act_report.pdf
    Documentation of analysis and insights

  • twitter_archive_master.csv
    Cleaned and merged dataset containing data from the 3 source data sets

6. Results

As said in the project motivation, the data wrangling process itself was more relevant than uncovering insights. At any rate, I was able to answer the following 4 questions:

  1. What is the most retweeted tweet?
    From the data I had from 2015 to 2017, this gem was the most retweeted tweet.
  2. What is the most common rating?
    12/10
  3. What are the most common breeds found by the neural network?
    The top 5, from less to most common, were Pug, Chihuahua, Welsh Corgi, Labrador Retriever, then finally Golden Retriever.
  4. What is the average retweet count for each rating?
    Screen Shot 2022-01-11 at 21 22 39
    I saw a general positive correlation between dog rating and retweet count (i.e. popularity). 13/10 and 14/10 tweets had the most retweets on average. Further details of the results can be seen in the act_report.pdf file.

7. Licensing, Authors, and Acknowledgements

All data provided and sourced by Udacity.

Owner
Keenan Cooper
Keenan Cooper
Hydrogen (or other pure gas phase species) depressurization calculations

HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi

Anders Andreasen 13 Nov 26, 2022
A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
Weather Image Recognition - Python weather application using series of data

Weather Image Recognition - Python weather application using series of data

Kushal Shingote 1 Feb 04, 2022
Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Automatic regional-scale earthquake catalog building workflow: EQTransformer + Siamese EQTransforme

Xiao Zhuowei 9 Nov 27, 2022
A Python and R autograding solution

Otter-Grader Otter Grader is a light-weight, modular open-source autograder developed by the Data Science Education Program at UC Berkeley. It is desi

Infrastructure Team 93 Jan 03, 2023
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 03, 2023
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 06, 2021
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Tsinghua Machine Learning Group 2.2k Dec 28, 2022
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

Nicolai Haaber Junge 1 Oct 20, 2021