入门文章学习(一)-Beginner Tutorial
阅读原文时间:2023年07月08日阅读:1

Abstract: 参照“背景知识查阅”一文的学习路径,对几篇文章的学习做了记录。这是"Beginner Tutorial"一文的学习笔记。

文章链接: https://www.datacamp.com/community/tutorials/apache-spark-python

1. 背景知识

1.1 Spark:

General engine for big data processing

Modules: Streaming, SQL, machine learning, graph processing

Pro: Speed, ease of use, generality, virtual running environment

1.2 Content

Programming language; Spark with python; RDD vs DataFrame API vs DataSet API; Spark Dataframes vs Pandas DataFrames; RDD actions and transformations; Cache/Persist RDD, Broadcast variables; Intro to Spark practice with DataFrame and Spark UI; TUrn off the logging for PySaprk.

1.3 Spark Performance: Scala or Python?

1.3.1 Scala faster than Python, recommended for streaming data. Though structured streaming in Spark seems to reduce the gap already.

1.3.2 For DataFrame API, the differences b2n Python and Scala are not obvious.

- Favor built-in expressions if working with Python:  for the User Defined Functions (UDFs) are less efficient than the Scala equivalents.

- Not to pass the data b2n Dtaframe and RDD unnecessarily: the serialization (object -> bytes) and deserialization (bytes -> object) of the data transfer are expensive.

1.3.3 Scala

- Play framework -> clean + performant async code;

- Play is fully asynchronous -> have concurrent connections without dealing with threads -> Easier I/O calls in parallel to improve performance + the use of real-time, streaming, and server push technologies;

1.3.4 Type Safety

Python: Good for smaller ad hoc experiments - Dynamically typed language

    Each variable name is bound only to an object unless it is null;

    Type checking happens at run time;

    No need to specify types every time;

    e.g, Ruby, Python

Scale: Bigger projects - Statically typed language

   Each variable name is bound both to a type and an object;

   Type checking at compile time;

   Easier and hassle-free when refactoring.

1.3.5 Advanced Features

Tools for machine learning and NLP - SparkMLib

2. Spark Installation

教程里提供了本地Installation以及结合使用Notebook和本地Spark的方法。

以及Notebook+Spark Kernel的方法。还有DockerHub的方法。都没太看懂。

个人的安装方法是:

1. 本地Spark-shell的安装。显示的是Scala。貌似用的是Scala语言。懵。

2. Anaconda上给相应环境配置Pyspark和Py4j的安装包,然后在Jupyter notebook里使用相应Kernel运行代码。配置了两个环境,有一个出错了,没找到出错原因。另外一个有condEnv prefix的可以使用。

3. Spark APIs: RDDs ,Dataset and DataFrame

3.1 RDD

- The building blocks of Spark

- A set of java or Scala objects representing data

- 3 main characteristics: compile-time type safe + lazy (只计算一次,然后缓存起来,之后都用缓存数据) + based on Scala collections API

- Cons: inefficient and un-readable transformation chains; slow with non-JVM languages such as Python and can not be opyimized by Spark

3.2 DataFrames

- Enable a higher level abstraction allowing the uers to use query language to manipulate the data.

- "Higher level abstraction": A logical plan that represents data and a schema. 建构了包装RDD的数据概念(Spark写了这种数据机制的代码,所以用户可以直接用了),基于此可视化对RDD的处理。

- Remeber! The DataFrames are still built on top of RDDs!

- DataFrames can be optimized with

  * Custom memory management (project Tungsten) - Make sure the Spark jobs much faster given constraints.

* Optimized execution plans (Catalyst optimizer) - Logical plan of the DtaFrame is a part.

- For Python is dynamically typed, only the untyped DataFrame API and uncripted Dataset API are available.

3.3 Datasets

DataFrames lost the compile-time type safety, so the code was more prone to errors.

Datasets was raised for a combination of the type safety/lambda functions given by RDDs, and the optimalizations offered by the DataFrames.

- Dataset API

  * A strongly-typed API

  * An untyped API

  * A DataFrame is a synonym for Dataset[Row] in Scala ; Row is a generic untyped JVM object.

    The Dataset is a collection of strongly-typed JVM objects.

- DataSet API: static typing and the runtime type safert

3.4 Summary

The higher level abstraction over the data, the more performance and optimization. Help forces work with more strcutured data and easier use of APIs.

3.5 When to use?

- Advised to use DataFrames when working with PySpark, because they are close to the DtaFrame strcuture from the pandas library.

- To use DatasetAPI: want use high-level expressions/SQL queries/columnar access/ use of lambda functions…on semi-strcutured data. (untyped API)

- To use RDDs: low-level transformations and actions on the unstructured data. Don't care about imposing a schema when accessing the atrributes by name. Do not need optimization and performace benifits from DF and Datasets for (semi-)strcutured data. Wnt to functional programming constructs rather than domain speciic expressions.

4. Diffrence between Spark DataFrames and Pandas DataFrames

DataFrames & Relational database

Spark DataFrames carry the specific optimalization under the hood and can use distributed memory to handle big data.

Pandas DataFrames and R data frames can only run on one computer.

Spark DF and Pandas DF integrates quite well: df.toPandas(). Wide range of external libraries and APIs can be used.

5. RDD: Ations and Transformations

5.1 RDDs support two types of operations

- Transformations: create a new dataset from an existing one

e.g. map() - A transformation passing each dataset element through a function and returns a new RDD representing the results.

Lazy Transformation:They just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.

- Actions: return a value to the driver program after the computation on the dataset

e.g reduce() - An action aggregating all the elements of the RDD and returns the final result to the driver program.

5.2 Advantages

Spark can run more efficiently: a dataset created through map() operation will be used in a consequent reduce() operation and will return only the result of the the last reduce function to the driver. That way, the reduced data set rather than the larger mapped data set will be returned to the user. This is more efficient, without a doubt!

6. RDD: Cache/Persist & Variable:Persist/Broadcast

6.1 Cache: By default, each transformed RDD may be recomputed each time you run an action on it. But by perssting an RDD in memory/disk/multiple nodes, the Spark will keep the e;ements around on the cluster for much faster access the next time you query it.

A couple of use cases for caching or persisting RDDs are the use of iterative algorithms and fast interactive RDD use.

6.2 Persist or Broadcast Variable:

Entire RDD -> Partitions

When executing the Spark program, each partition gets sent to a worker. Each worker can cache the data if the RDD needs to be re-iterated: stroe the patrtition in memory and be reused in other actions.

Variable: when pass a function to a Spark operation, the variable inside the function will be sent to each cluster node.

Broadcast variables: When redistribuing intermediate results of operations such as the trained models or a composed static lookup table. Broadcasting variables to send immutable state once to each worker, can avoid vreating a copy of the variable for each machine. A cached read-only variable can be kept in every machine, and these variables can be used when needing a local copy of a variable.

You can create a broadcast variable with SparkContext.broadcast(variable). This will return the reference of the broadcast variable.

7. Best Practices in Spark

- Spark DataFrames are optimized and faster than RDDs. esp. for strcutured data.

- Better not call collect() on large RDDs, for it drags data back to the appilication from the nodes. The RDD element will be copied onto the single driver program, which will result in running out of memory and crash.

- Build efficient transformation chain:filter and reduce data before joining them rather than after them.

- Avoid groupByKey() on large RDDs: A lot of unnecessary data is being transferred over the network. Additionally, this also means that if more data is shuffled onto a single machine than can fit in memory, the data will be spilled to disk. This heavily impacts the performance of your Spark job.

没消化完。。时间不够了,先看代码把作业写了。

Spark Cheat Sheet:

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

走神感想:在咖啡馆学习,身边的人们好像是做媒体的。dbq,但是我只是觉得这样的行业可有可无,所以当他们谈论公事的时候我总觉得这种小孩子过家家式的一本正经有一点让人无语。当然,这是我浅薄的体现。还有一些大叔在外放抖音(捂面),总之Keep Learning确实是人保持弹性和活力的关键所在,再提醒自己要和老于(还有老师!)一样,做这样的人。