bigdata_analyse 大数据分析项目

Overview

bigdata_analyse

大数据分析项目

wish

采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标:

  • 了解不同领域的业务分析指标
  • 深化数据处理、数据分析、数据可视化能力
  • 增加大数据批处理、流处理的实践经验
  • 增加数据挖掘的实践经验

tip

  • 项目主要使用的编程语言是 python、sql、hql
  • .ipynb 可以用 jupyter notebook 打开,如何安装, 可以参考 jupyter notebook

jupyter notebook 是一种网页交互形式的 python 编辑器,直接通过 pip 安装,也支持 markdown,很适合用来做数据分析可视化以及写文章、写示例代码等。

list

主题 处理方式 技术栈 数据集下载
1 亿条淘宝用户行为数据分析 离线处理 清洗 hive + 分析 hive + 可视化 echarts 阿里云 或者 百度网盘 提取码:5ipq
1000 万条淘宝用户行为数据实时分析 实时处理 数据源 kafka + 实时分析 flink + 可视化(es + kibana) 百度网盘 提取码:m4mc
300 万条《野蛮时代》的玩家数据分析 离线处理 清洗 pandas + 分析 mysql + 可视化 pyecharts 百度网盘 提取码:paq4
130 万条深圳通刷卡数据分析 离线处理 清洗 pandas + 分析 impala + 可视化 dbeaver 百度网盘 提取码:t561
10 万条厦门招聘数据分析 离线处理 清洗 pandas + 分析 hive + 可视化 ( hue + pyecharts ) + 预测 sklearn 百度网盘 提取码:9wx0
7000 条租房数据分析 离线处理 清洗 pandas + 分析 sqlite + 可视化 matplotlib 百度网盘 提取码:9en3
6000 条倒闭企业数据分析 离线处理 清洗 pandas + 分析 pandas + 可视化 (jupyter notebook + pyecharts) 百度网盘 提取码:xvgm

refer

  1. https://tianchi.aliyun.com/dataset/
  2. https://opendata.sz.gov.cn/data/api/toApiDetails/29200_00403601
  3. https://www.kesci.com/home/dataset
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
Python package for analyzing sensor-collected human motion data

Python package for analyzing sensor-collected human motion data

Simon Ho 71 Nov 05, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022
In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.

Raster_Sampling_Demo (Resulting graph of this demo) Background Sampling values of a raster at specific geographic coordinates can be done with a numbe

2 Dec 13, 2022
Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

Materials Discovery Group 61 Oct 02, 2022
Employee Turnover Analysis

Employee Turnover Analysis Submission to the DataCamp competition "Can you help reduce employee turnover?"

Jannik Wiedenhaupt 1 Feb 13, 2022
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

G. Mias Lab 13 Jun 29, 2022
Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

Schlerp 55 Nov 18, 2022
Pipetools enables function composition similar to using Unix pipes.

Pipetools Complete documentation pipetools enables function composition similar to using Unix pipes. It allows forward-composition and piping of arbit

186 Dec 29, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This repo contains a powerful tool made using python which is used to visualize, analyse and finally assess the quality of the product depending upon the given observations

SasiVatsal 8 Oct 18, 2022
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to

Derek Snow 527 Jan 02, 2023
DaCe is a parallel programming framework that takes code in Python/NumPy and other programming languages

aCe - Data-Centric Parallel Programming Decoupling domain science from performance optimization. DaCe is a parallel programming framework that takes c

SPCL 330 Dec 30, 2022