Spark Learning

这篇博客是本人在学习和使用Spark的过程中的所作的笔记！

Spark
- Spark core
- Spark SQL
- Spark Streaming
- MLib(mechine learing)
- GraphX(graph)
submit the spark job.

mvn clean && mvn compile && mvn package
$SPARK_HOME/bin/spark-submit \--class com.oreilly.learningsparkexamples.mini.java.WordCount \./target/learning-spark-mini-example-0.0.1.jar \./README.md ./wordcounts

RDD:
Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.

As we’ve discussed, RDDs support two types of operations: transformations and actions. Transformations are operations on RDDs that return a new RDD, such as map() and filter(). Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, such as count() and first(). Spark treats transformations and actions very differently, so understanding which type of operation you are performing will be important. If you are ever confused whether a given function is a transformation or an action, you can look at its return type: transformations return RDDs, whereas actions return some other data type.

It is important to note that each time we call a new action, the entire RDD must be computed “from scratch.” To avoid this inefficiency, users can persist intermediate results, as we will cover in “Persistence (Caching)”.

aggregate
the method can change the Type. the following code is for get average.

val result = input.aggregate((0, 0))(/*an operator used to accumulate results within a partition, this operator is similar with fold()*/(acc, value) => (acc._1 + value, acc._2 + 1),/*an associative operator used to combine results from different partitions*/(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble

cogroup??
partitionBy

val sc = new SparkContext(...)
val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...")
                 .partitionBy(new HashPartitioner(100))   // Create 100 partitions
                 .persist()

Note that partitionBy() is a transformation, so it always returns a new RDD — it does not change the original RDD in place. RDDs can never be modified once created. Therefore it is important to persist and save as userData the result of partitionBy(), not the original sequenceFile(). Also, the 100 passed to partitionBy() represents the number of partitions, which will control how many parallel tasks perform further operations on the RDD (e.g., joins); in general, make this at least as large as the number of cores in your cluster.without persist(), subsequent RDD actions will evaluate the entire lineage of partitioned, which will cause pairs to be hash-partitioned over and over.

scala> val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3)))
pairs: spark.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12scala> pairs.partitioner
res0: Option[spark.Partitioner] = Nonescala> val partitioned = pairs.partitionBy(new spark.HashPartitioner(2))
partitioned: spark.RDD[(Int, Int)] = ShuffledRDD[1] at partitionBy at <console>:14scala> partitioned.partitioner
res1: Option[spark.Partitioner] = Some(spark.HashPartitioner@5147788d)

shared variables: aggregation and broadcasts

Spark’s shared variables, accumulators and broadcast variables, relax this restriction for two common types of communication patterns: aggregation of results and broadcasts.

mapPartitions and map

Q1. What’s the difference between an RDD’s map and mapPartitions
map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level.

Tip
Spark’s documentation consistently uses the terms driver and executor when describing the processes that execute each Spark application. The terms master and worker are used to describe the centralized and distributed portions of the cluster manager. It’s easy to confuse these terms, so pay close attention. For instance, Hadoop YARN runs a master daemon (called the Resource Manager) and several worker daemons called Node Managers. Spark can run both drivers and executors on the YARN worker nodes.

spark configuration precedence order
The highest priority is given to configurations declared explicitly in the user’s code using the set() function on a SparkConf object. Next are flags passed to spark-submit, then values in the properties file, and finally default values.
coalesce() operator

problem and solution

problem 1

a Spark error: Invalid signature file digest for Manifest main attributes

原因：

When using spark-submit to run a jar, you may encounter this error:
Invalid signature file digest for Manifest main attributes
The error occurs when one of the included libraries in the jar's META-INF directory has a bad signature.

solution 1-1

One of the jars you extracted to the target jar was signed. The easiest way to fix it is to select "copy to the output directory and link via manifest" in the "Create JAR from Modules" dialog

solution 1-2
Solve the error with this command，删除所引进的jar包中的相关文件:

zip -d <jar file name>.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF

Spark的算子：

Function name Purpose Example Result

输入数据：
{(1, 2), (3, 4), (3, 6)}

函数1：
reduceByKey(func)

描述：
Combine values with
the same key.

例子：
rdd.reduceByKey(
(x, y) => x + y)

结果：
{(1,2), (3,10)}

函数2：
groupByKey()

描述：
Group values with the same key.

例子：
rdd.groupByKey()

结果：
{(1,[2]),(3, [4,6])}

函数3：
combineByKey ( createCombiner,mergeValue,mergeCombiners,partitioner )

描述：
Combine values with the same key using a different result type. Apply a function to each value of a pair RDD without
changing the key. rdd.mapValues(x => x+1) flatMapValues(func) Apply a function that
returns an iterator to
each value of a pair
RDD, and for each
element returned,
produce a key/value

结果：
{(1,3), (3,5), (3,7)}

mllib-statistics

google-math

programming-guide

tuning-spark

Spark Learning相关推荐

Spark在文本统计中的简单应用
问题起因学长之前用Java写了一个程序,有两个文档,其中一个文档是regular expression,大概有8万行,每一行是一个regular expression,我们称之为pattern:另一 ...
Spark Machine Learning 03 Spark上数据的获取、处理与准备
Chap 03 Spark上数据的获取处理 Spark上数据的获取.处理与准备 MovieStream,数据包括网站提供的电影数据.用户的服务信息数据以及行为数据. 这些数据涉及电影和相关内容(比如标 ...
Machine Learning on Spark——第四节统计基础（二)
作者:周志湖微信号:zhouzhihubeyond 本节主要内容 Correlation 相关性分析分层采样(Stratified sampling) 随机数据生成(Random data gen ...
Machine Learning on Spark——统计基础（二)
本节主要内容 Correlation 相关性分析分层采样(Stratified sampling) 随机数据生成(Random data generation) 1. Correlation 相关性 ...
Machine Learning on Spark—— 统计基础（一)
本文主要内容本文对了org.apache.Spark.mllib.stat包及子包中的相关统计类进行介绍,stat包中包括下图中的类或对象: 本文将对其中的部分内容进行详细讲解获取矩阵列(colu ...
Machine Learning On Spark——基础数据结构（二)
本节主要内容 IndexedRowMatrix BlockMatrix 1. IndexedRowMatrix的使用 IndexedRowMatrix,顾名思义就是带索引的RowMatrix,它采用c ...
Machine Learning On Spark——基础数据结构（一)
本节主要内容本地向量和矩阵带类标签的特征向量(Labeled point) 分布式矩阵 1. 本地向量和矩阵本地向量(Local Vector)存储在单台机器上,索引采用0开始的整型表示,值采用 ...
线性回归的Spark实现 [Linear Regression / Machine Learning / Spark]
1- 问题提出 2- 线性回归 3- 理论推导 4- Python/Spark实现 1 # -*- coding: utf-8 -*- 2 from pyspark import SparkConte ...
Learning Spark中文版--第三章--RDD编程（1）
本章介绍了Spark用于数据处理的核心抽象概念,具有弹性的分布式数据集(RDD).一个RDD仅仅是一个分布式的元素集合.在Spark中,所有工作都表示为创建新的RDDs.转换现有的RDD,或者调 ...

Spark Learning

problem and solution

problem 1

Spark Learning相关推荐

最新文章

热门文章