pyspark🍒🥭 is delicious,just eat it!😋😋

Overview

如何用10天吃掉pyspark? 🔥 🔥

《10天吃掉那只pyspark》

《20天吃掉那只Pytorch》

《30天吃掉那只TensorFlow2》

一,pyspark 🍎 or spark-scala 🔥 ?

pyspark强于分析,spark-scala强于工程。

如果应用场景有非常高的性能需求,应该选择spark-scala.

如果应用场景有非常多的可视化和机器学习算法需求,推荐使用pyspark,可以更好地和python中的相关库配合使用。

此外spark-scala支持spark graphx图计算模块,而pyspark是不支持的。


pyspark学习曲线平缓,spark-scala学习曲线陡峭。

从学习成本来说,spark-scala学习曲线陡峭,不仅因为scala是一门困难的语言,更加因为在前方的道路上会有无尽的环境配置痛苦等待着读者。

而pyspark学习成本相对较低,环境配置相对容易。从学习成本来说,如果说pyspark的学习成本是3,那么spark-scala的学习成本大概是9。

如果读者有较强的学习能力和充分的学习时间,建议选择spark-scala,能够解锁spark的全部技能,并获得最优性能,这也是工业界最普遍使用spark的方式。

如果读者学习时间有限,并对Python情有独钟,建议选择pyspark。pyspark在工业界的使用目前也越来越普遍。


二,本书 📚 面向读者 🤗

本书假定读者具有基础的的Python编码能力,熟悉Python中numpy, pandas库的基本用法。

并且假定读者具有一定的SQL使用经验,熟悉select,join,group by等sql语法。

对于Python基础不是非常扎实的读者,可以参考《3小时Python入门》文章。

《3小时Python入门》

对于numpy和Pandas不甚了解的读者,可以参考 《3小时入门numpy,pandas,matplotlib》文章。

《3小时入门numpy,pandas,matplotlib》


三,本书写作风格 🍉

本书是一本对人类用户极其友善的pyspark入门工具书,Don't let me think是本书的最高追求。

本书主要是在参考spark官方文档,并结合作者学习使用经验基础上整理总结写成的。

不同于Spark官方文档的繁冗断码,本书在篇章结构和范例选取上做了大量的优化,在用户友好度方面更胜一筹。

本书按照内容难易程度、读者检索习惯和spark自身的层次结构设计内容,循序渐进,层次清晰,方便按照功能查找相应范例。

本书在范例设计上尽可能简约化和结构化,增强范例易读性和通用性,大部分代码片段在实践中可即取即用。

如果说通过学习spark官方文档掌握pyspark的难度大概是5,那么通过本书学习掌握pyspark的难度应该大概是2.

仅以下图对比spark官方文档与本书《10天吃掉那只pyspark》的差异。


四,本书学习方案

1,学习计划

本书是作者利用工作之余大概1个月写成的,大部分读者应该在10天可以完全学会。

预计每天花费的学习时间在30分钟到2个小时之间。

当然,本书也非常适合作为pyspark的工具手册在工程落地时作为范例库参考。

点击学习内容蓝色标题即可进入该章节。

日期 学习内容 内容难度 预计学习时间 更新状态
  一、基础篇      
day1 1-1,快速搭建你的Spark开发环境 ⭐️ ⭐️ 1hour
day2 1-2,1小时看懂Spark的基本原理 ⭐️ ⭐️ ⭐️ 1hour
  二、核心篇      
day3 2-1,2小时入门Spark之RDD编程 ⭐️ ⭐️ ⭐️ 2hour
day4 2-2,7道RDD编程练习题 ⭐️ ⭐️ ⭐️ 1hour
day5 2-3,2小时入门SparkSQL编程 ⭐️ ⭐️ ⭐️ 2hour
day6 2-4,7道SparkSQL编程练习题 ⭐️ ⭐️ ⭐️ 1hour
  三、进阶篇      
day7 3-1,Spark性能调优方法 ⭐️ ⭐️ ⭐️ ⭐️ ⭐️ 2hour
day8 3-2,RDD和SparkSQL综合应用 ⭐️ ⭐️ ⭐️ ⭐️ ⭐️ 2hour
  四、拓展篇      
day9 4-1,探索MLlib机器学习 ⭐️ ⭐️ ⭐️ ⭐️ 2hour
day10 4-2,初识StructuredStreaming ⭐️ ⭐️ ⭐️ ⭐️ 2hour

2,学习环境

本书全部源码在jupyter中编写测试通过,建议通过git克隆到本地,并在jupyter中交互式运行学习。

为了直接能够在jupyter中打开markdown文件,建议安装jupytext,将markdown转换成ipynb文件。

为简单起见,本书按照如下2个步骤配置单机版spark3.0.1环境进行练习。

step1: 安装java8

jdk下载地址:https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

java安装教程:https://www.runoob.com/java/java-environment-setup.html

step2: 安装pyspark,findspark

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark

pip install findspark

此外,也可以在kesci云端notebook中直接运行pyspark

https://www.kesci.com/home/project

import findspark

#指定spark_home,指定python路径
spark_home = "/Users/liangyun/anaconda3/lib/python3.7/site-packages/pyspark"
python_path = "/Users/liangyun/anaconda3/bin/python"
findspark.init(spark_home,python_path)

import pyspark 
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("test").setMaster("local[4]")
sc = SparkContext(conf=conf)

print("spark version:",pyspark.__version__)
rdd = sc.parallelize(["hello","spark"])
print(rdd.reduce(lambda x,y:x+' '+y))
spark version: 3.0.1
hello spark

除了以上方法外,也可以参考1-1节中介绍的其它方法。

1-1,快速搭建你的Spark开发环境


五,鼓励和联系作者

如果本书对你有所帮助,想鼓励一下作者,记得给本项目加一颗星星star ⭐️ ,并分享给你的朋友们喔 😊 !

如果对本书内容理解上有需要进一步和作者交流的地方,欢迎在公众号"算法美食屋"下留言。作者时间和精力有限,会酌情予以回复。

也可以在公众号后台回复关键字:spark加群,加入spark和大数据读者交流群和大家讨论。

image.png


Owner
lyhue1991
dream-->design-->deliever😋😋
lyhue1991
An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics.

Sketch Simulator An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics. See

12 Dec 18, 2022
This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

Raymond 247 Dec 28, 2022
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 09, 2023
Data & Code for ACCENTOR Adding Chit-Chat to Enhance Task-Oriented Dialogues

ACCENTOR: Adding Chit-Chat to Enhance Task-Oriented Dialogues Overview ACCENTOR consists of the human-annotated chit-chat additions to the 23.8K dialo

Facebook Research 69 Dec 29, 2022
NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

NNR and global probabilities estimation and analysis in peptides or protein fragments This module calculates global and NNR conformation dependent pro

0 Jul 15, 2021
The Official Repository for "Generalized OOD Detection: A Survey"

Generalized Out-of-Distribution Detection: A Survey 1. Overview This repository is with our survey paper: Title: Generalized Out-of-Distribution Detec

Jingkang Yang 338 Jan 03, 2023
The code for paper "Learning Implicit Fields for Generative Shape Modeling".

implicit-decoder The tensorflow code for paper "Learning Implicit Fields for Generative Shape Modeling", Zhiqin Chen, Hao (Richard) Zhang. Project pag

Zhiqin Chen 353 Dec 30, 2022
FlowTorch is a PyTorch library for learning and sampling from complex probability distributions using a class of methods called Normalizing Flows

FlowTorch is a PyTorch library for learning and sampling from complex probability distributions using a class of methods called Normalizing Flows.

Meta Incubator 272 Jan 02, 2023
USAD - UnSupervised Anomaly Detection on multivariate time series

USAD - UnSupervised Anomaly Detection on multivariate time series Scripts and utility programs for implementing the USAD architecture. Implementation

116 Jan 04, 2023
quantize aware training package for NCNN on pytorch

ncnnqat ncnnqat is a quantize aware training package for NCNN on pytorch. Table of Contents ncnnqat Table of Contents Installation Usage Code Examples

62 Nov 23, 2022
GMFlow: Learning Optical Flow via Global Matching

GMFlow GMFlow: Learning Optical Flow via Global Matching Authors: Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Dacheng Tao We streamline the

Haofei Xu 298 Jan 04, 2023
Implementation of ICCV2021(Oral) paper - VMNet: Voxel-Mesh Network for Geodesic-aware 3D Semantic Segmentation

VMNet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation Created by Zeyu HU Introduction This work is based on our paper VMNet: Voxel-Mes

HU Zeyu 82 Dec 27, 2022
Continuous Diffusion Graph Neural Network

We present Graph Neural Diffusion (GRAND) that approaches deep learning on graphs as a continuous diffusion process and treats Graph Neural Networks (GNNs) as discretisations of an underlying PDE.

Twitter Research 227 Jan 05, 2023
A crossplatform menu bar application using mpv as DLNA Media Renderer.

Macast Chinese README A menu bar application using mpv as DLNA Media Renderer. Install MacOS || Windows || Debian Download link: Macast release latest

4.4k Jan 01, 2023
Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

CoTuning Official implementation for NeurIPS 2020 paper Co-Tuning for Transfer Learning. [News] 2021/01/13 The COCO 70 dataset used in the paper is av

THUML @ Tsinghua University 35 Sep 23, 2022
using yolox+deepsort for object-tracker

YOLOX_deepsort_tracker yolox+deepsort实现目标跟踪 最新的yolox尝尝鲜~~(yolox正处在频繁更新阶段,因此直接链接yolox仓库作为子模块) Install Clone the repository recursively: git clone --rec

245 Dec 26, 2022
A PyTorch Implementation of "Neural Arithmetic Logic Units"

Neural Arithmetic Logic Units [WIP] This is a PyTorch implementation of Neural Arithmetic Logic Units by Andrew Trask, Felix Hill, Scott Reed, Jack Ra

Kevin Zakka 181 Nov 18, 2022
The official repository for paper ''Domain Generalization for Vision-based Driving Trajectory Generation'' submitted to ICRA 2022

DG-TrajGen The official repository for paper ''Domain Generalization for Vision-based Driving Trajectory Generation'' submitted to ICRA 2022. Our Meth

Wang 25 Sep 26, 2022
Code for "Layered Neural Rendering for Retiming People in Video."

Layered Neural Rendering in PyTorch This repository contains training code for the examples in the SIGGRAPH Asia 2020 paper "Layered Neural Rendering

Google 154 Dec 16, 2022
[ICCV'2021] Image Inpainting via Conditional Texture and Structure Dual Generation

[ICCV'2021] Image Inpainting via Conditional Texture and Structure Dual Generation

Xiefan Guo 122 Dec 11, 2022