这篇博客是本人在学习和使用Spark的过程中的所作的笔记!

  • Spark

    • Spark core
    • Spark SQL
    • Spark Streaming
    • MLib(mechine learing)
    • GraphX(graph)
  • submit the spark job.

mvn clean && mvn compile && mvn package
$SPARK_HOME/bin/spark-submit \--class com.oreilly.learningsparkexamples.mini.java.WordCount \./target/learning-spark-mini-example-0.0.1.jar \./README.md ./wordcounts
  • RDD:
    Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.

As we’ve discussed, RDDs support two types of operations: transformations and actions. Transformations are operations on RDDs that return a new RDD, such as map() and filter(). Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, such as count() and first(). Spark treats transformations and actions very differently, so understanding which type of operation you are performing will be important. If you are ever confused whether a given function is a transformation or an action, you can look at its return type: transformations return RDDs, whereas actions return some other data type.

It is important to note that each time we call a new action, the entire RDD must be computed “from scratch.” To avoid this inefficiency, users can persist intermediate results, as we will cover in “Persistence (Caching)”.

  • aggregate
    the method can change the Type. the following code is for get average.
val result = input.aggregate((0, 0))(/*an operator used to accumulate results within a partition, this operator is similar with fold()*/(acc, value) => (acc._1 + value, acc._2 + 1),/*an associative operator used to combine results from different partitions*/(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
  • cogroup??

  • partitionBy

val sc = new SparkContext(...)
val userData = sc.sequenceFile[UserID, UserInfo]("hdfs://...")
                 .partitionBy(new HashPartitioner(100))   // Create 100 partitions
                 .persist()

Note that partitionBy() is a transformation, so it always returns a new RDD — it does not change the original RDD in place. RDDs can never be modified once created. Therefore it is important to persist and save as userData the result of partitionBy(), not the original sequenceFile(). Also, the 100 passed to partitionBy() represents the number of partitions, which will control how many parallel tasks perform further operations on the RDD (e.g., joins); in general, make this at least as large as the number of cores in your cluster.without persist(), subsequent RDD actions will evaluate the entire lineage of partitioned, which will cause pairs to be hash-partitioned over and over.

scala> val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3)))
pairs: spark.RDD[(Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12scala> pairs.partitioner
res0: Option[spark.Partitioner] = Nonescala> val partitioned = pairs.partitionBy(new spark.HashPartitioner(2))
partitioned: spark.RDD[(Int, Int)] = ShuffledRDD[1] at partitionBy at <console>:14scala> partitioned.partitioner
res1: Option[spark.Partitioner] = Some(spark.HashPartitioner@5147788d)
  • shared variables: aggregation and broadcasts

Spark’s shared variables, accumulators and broadcast variables, relax this restriction for two common types of communication patterns: aggregation of results and broadcasts.

  • mapPartitions and map

Q1. What’s the difference between an RDD’s map and mapPartitions
map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level.

Tip
Spark’s documentation consistently uses the terms driver and executor when describing the processes that execute each Spark application. The terms master and worker are used to describe the centralized and distributed portions of the cluster manager. It’s easy to confuse these terms, so pay close attention. For instance, Hadoop YARN runs a master daemon (called the Resource Manager) and several worker daemons called Node Managers. Spark can run both drivers and executors on the YARN worker nodes.

  • spark configuration precedence order
    The highest priority is given to configurations declared explicitly in the user’s code using the set() function on a SparkConf object. Next are flags passed to spark-submit, then values in the properties file, and finally default values.

  • coalesce() operator

problem and solution

problem 1

a Spark error: Invalid signature file digest for Manifest main attributes

原因:

When using spark-submit to run a jar, you may encounter this error:
Invalid signature file digest for Manifest main attributes
The error occurs when one of the included libraries in the jar's META-INF directory has a bad signature.

solution 1-1

One of the jars you extracted to the target jar was signed. The easiest way to fix it is to select "copy to the output directory and link via manifest" in the "Create JAR from Modules" dialog

solution 1-2
Solve the error with this command,删除所引进的jar包中的相关文件:

zip -d <jar file name>.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF

Spark的算子:

Function name Purpose Example Result

输入数据:
{(1, 2), (3, 4), (3, 6)}

函数1:
reduceByKey(func)

描述:
Combine values with
the same key.

例子:
rdd.reduceByKey(
(x, y) => x + y)

结果:
{(1,2), (3,10)}


函数2:
groupByKey()

描述:
Group values with the same key.

例子:
rdd.groupByKey()

结果:
{(1,[2]),(3, [4,6])}


函数3:
combineByKey ( createCombiner,mergeValue,mergeCombiners,partitioner )

描述:
Combine values with the same key using a different result type. Apply a function to each value of a pair RDD without
changing the key. rdd.mapValues(x => x+1) flatMapValues(func) Apply a function that
returns an iterator to
each value of a pair
RDD, and for each
element returned,
produce a key/value

结果:
{(1,3), (3,5), (3,7)}

mllib-statistics

google-math

programming-guide

tuning-spark

Spark Learning相关推荐

  1. Spark在文本统计中的简单应用

    问题起因 学长之前用Java写了一个程序,有两个文档,其中一个文档是regular expression,大概有8万行,每一行是一个regular expression,我们称之为pattern:另一 ...

  2. Spark Machine Learning 03 Spark上数据的获取、处理与准备

    Chap 03 Spark上数据的获取处理 Spark上数据的获取.处理与准备 MovieStream,数据包括网站提供的电影数据.用户的服务信息数据以及行为数据. 这些数据涉及电影和相关内容(比如标 ...

  3. Machine Learning on Spark——第四节 统计基础(二)

    作者:周志湖 微信号:zhouzhihubeyond 本节主要内容 Correlation 相关性分析 分层采样(Stratified sampling) 随机数据生成(Random data gen ...

  4. Machine Learning on Spark——统计基础(二)

    本节主要内容 Correlation 相关性分析 分层采样(Stratified sampling) 随机数据生成(Random data generation) 1. Correlation 相关性 ...

  5. Machine Learning on Spark—— 统计基础(一)

    本文主要内容 本文对了org.apache.Spark.mllib.stat包及子包中的相关统计类进行介绍,stat包中包括下图中的类或对象: 本文将对其中的部分内容进行详细讲解 获取矩阵列(colu ...

  6. Machine Learning On Spark——基础数据结构(二)

    本节主要内容 IndexedRowMatrix BlockMatrix 1. IndexedRowMatrix的使用 IndexedRowMatrix,顾名思义就是带索引的RowMatrix,它采用c ...

  7. Machine Learning On Spark——基础数据结构(一)

    本节主要内容 本地向量和矩阵 带类标签的特征向量(Labeled point) 分布式矩阵 1. 本地向量和矩阵 本地向量(Local Vector)存储在单台机器上,索引采用0开始的整型表示,值采用 ...

  8. 线性回归的Spark实现 [Linear Regression / Machine Learning / Spark]

    1- 问题提出 2- 线性回归 3- 理论推导 4- Python/Spark实现 1 # -*- coding: utf-8 -*- 2 from pyspark import SparkConte ...

  9. Learning Spark中文版--第三章--RDD编程(1)

       本章介绍了Spark用于数据处理的核心抽象概念,具有弹性的分布式数据集(RDD).一个RDD仅仅是一个分布式的元素集合.在Spark中,所有工作都表示为创建新的RDDs.转换现有的RDD,或者调 ...

最新文章

  1. AMD猛攻数据中心市场,拿下15年来最高份额,英特尔DCG收入下滑20%
  2. 成功解决import matplotlib; matplotlib.use('Agg') # pylint: disable=multiple-statements
  3. 初学者不建议月python吗_9.python入门
  4. 写MySQL存储过程实现动态执行SQL
  5. 【记忆化搜索】【dfs】【递归】Chocolate
  6. 计算机启动进入不了桌面图标,电脑开机后不显示桌面图标如何通过修改注册表解决问题...
  7. java ee面试题专家总结(必看),Java EE面试题专家总结(必看)
  8. 例子---PHP与Form表单之一
  9. 如何用python爬取公众号文章搜狗微信搜索_python抓取搜狗微信公众号文章
  10. 自动驾驶帆船,有史以来第一次成功横渡大西洋
  11. Response.Write 用法总结
  12. js混淆还原工具_技术分享:几种常见的JavaScript混淆和反混淆工具分析实战【转】...
  13. 人脸识别系统 讲解以及环境搭建(Java 附源码)
  14. 十大视频会议软件排名与免费视频会议软件有哪些?
  15. Python-QQ聊天记录分析-jieba+wordcloud
  16. 计算机桌面有扫描图标如何开始扫,如何使用扫描仪扫描文件
  17. Unity机器学习库ml-agents新版本的环境搭建
  18. android手势_您可能不知道的七个Android手势
  19. 修改Android10系统源码关闭selinux
  20. 清越光电科创板IPO过会:年营收6.9亿 高新创投是股东

热门文章

  1. u盘里文件夹变成屏幕保护程序_嘘!免费告诉你这几款U盘加密软件
  2. 前端笔记(9)元素的隐藏与显示,css用户界面样式,vertical-align垂直对齐,溢出文字省略号显示,css精灵技术,过渡,焦点,滑动门,margin负值
  3. 教你快速撸一个免费HTTPS证书
  4. 【大数据学习】数学基础及应用
  5. P1209 修理牛棚
  6. 我人生的第一个博客,真正的博客。
  7. System Center 2012 R2 CM系列之Configuration Manager介绍
  8. iscsi模型相关点
  9. MDT 2010之部署Windows XP-5
  10. 树状笔记软件for linux,Ubuntu 14.04安装开源树状笔记管理软件 WikidPad 2.2