bigdata_analyse 大数据分析项目

Last update: Dec 30, 2022

Related tags

Data Analysis bigdata_analyse

Overview

bigdata_analyse

大数据分析项目

wish

采用不同的技术栈，通过对不同行业的数据集进行分析，期望达到以下目标：

了解不同领域的业务分析指标
深化数据处理、数据分析、数据可视化能力
增加大数据批处理、流处理的实践经验
增加数据挖掘的实践经验

tip

项目主要使用的编程语言是 python、sql、hql
.ipynb 可以用 jupyter notebook 打开，如何安装, 可以参考 jupyter notebook

jupyter notebook 是一种网页交互形式的 python 编辑器，直接通过 pip 安装，也支持 markdown，很适合用来做数据分析可视化以及写文章、写示例代码等。

list

主题	处理方式	技术栈	数据集下载
1 亿条淘宝用户行为数据分析	离线处理	清洗 hive + 分析 hive + 可视化 echarts	阿里云或者百度网盘提取码：5ipq
1000 万条淘宝用户行为数据实时分析	实时处理	数据源 kafka + 实时分析 flink + 可视化（es + kibana）	百度网盘提取码：m4mc
300 万条《野蛮时代》的玩家数据分析	离线处理	清洗 pandas + 分析 mysql + 可视化 pyecharts	百度网盘提取码：paq4
130 万条深圳通刷卡数据分析	离线处理	清洗 pandas + 分析 impala + 可视化 dbeaver	百度网盘提取码：t561
10 万条厦门招聘数据分析	离线处理	清洗 pandas + 分析 hive + 可视化 ( hue + pyecharts ) + 预测 sklearn	百度网盘提取码：9wx0
7000 条租房数据分析	离线处理	清洗 pandas + 分析 sqlite + 可视化 matplotlib	百度网盘提取码：9en3
6000 条倒闭企业数据分析	离线处理	清洗 pandas + 分析 pandas + 可视化 (jupyter notebook + pyecharts)	百度网盘提取码：xvgm

refer

https://tianchi.aliyun.com/dataset/

https://opendata.sz.gov.cn/data/api/toApiDetails/29200_00403601

https://www.kesci.com/home/dataset

Owner

Way

Way

GitHub Repository

Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

1 Feb 13, 2022

An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

qgrid Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your D

2.9k Jan 08, 2023

Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

72 Jan 05, 2023

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

150 Jan 06, 2023

Conduits - A Declarative Pipelining Tool For Pandas

Conduits - A Declarative Pipelining Tool For Pandas Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can some

7 Nov 21, 2021

pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

182 Nov 11, 2022

A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023

A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

1 Nov 08, 2021

PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

13 Jun 29, 2022

Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

44 Sep 28, 2022

A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

3 Aug 22, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 03, 2023

Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

195 Dec 20, 2022

Port of dplyr and other related R packages in python, using pipda.

Unlike other similar packages in python that just mimic the piping syntax, datar follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from t

179 Dec 21, 2022

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

2k Dec 29, 2022

Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

255 Dec 10, 2022

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

214 Jan 02, 2023

Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022

Big Data & Cloud Computing for Oceanography

DS2 Class 2022, Big Data & Cloud Computing for Oceanography Home of the 2022 ISblue Big Data & Cloud Computing for Oceanography class (IMT-A, ENSTA, I

5 Mar 19, 2022

pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

5 Nov 19, 2022