Spark 案例（依据电商网站的真实需求）

数据说明
需求1：Top10 热门品类
- 需求说明
- 实现方案一
- - 需求分析
  - 需求实现
- 实现方案二
- - 需求分析
  - 需求实现
- 实现方案三
- - 需求分析
  - 需求实现
需求 2：Top10 热门品类中每个品类的 Top10 活跃Session 统计
- 需求说明
- - 需求分析
  - 需求实现

数据说明

在前面的博客中已经介绍了了 Spark 的基础编程方式，接下来，再看下在实际的工作中如何使用这些 API 实现具体的需求。这些需求是电商网站的真实需求，所以在实现功能前，先将数据准备好。
数据文档链接
提取码：xzc6

上面的数据图是从数据文件中截取的一部分内容，表示为电商网站的用户行为数据，主要包含用户的4种行为：搜索，点击，下单，支付。数据规则如下：
（1）数据文件中每行数据采用下划线分隔数据
（2）每一行数据表示用户的一次行为，这个行为只能是4 种行为的一种
（3）如果搜索关键字为 null,表示数据不是搜索数据
（4）如果点击的品类 ID 和产品ID 为-1，表示数据不是点击数据
（5）针对于下单行为，一次可以下单多个商品，所以品类 ID 和产品ID 可以是多个，id 之间采用逗号分隔，如果本次不是下单行为，则数据采用 null 表示
（6）支付行为和下单行为类似
详细字段说明：

根据上面的字段，自定义样例类：

//用户访问动作表
case class UserVisitAction( date: String,//用户点击行为的日期 user_id: Long,//用户的 ID session_id: String,//Session 的 ID page_id: Long,//某个页面的 ID action_time: String,//动作的时间点 search_keyword: String,//用户搜索的关键词 click_category_id: Long,//某一个商品品类的 ID click_product_id: Long,//某一个商品的 ID order_category_ids: String,//一次订单中所有品类的 ID 集合 order_product_ids: String,//一次订单中所有商品的 ID 集合 pay_category_ids: String,//一次支付中所有品类的 ID 集合 pay_product_ids: String,//一次支付中所有商品的 ID 集合 city_id: Long
)//城市 id

需求1：Top10 热门品类

需求说明

品类是指产品的分类，大型电商网站品类分多级，咱们的项目中品类只有一级，不同的公司可能对热门的定义不一样。我们按照每个品类的点击、下单、支付的量来统计热门品类。

鞋        点击数     下单数  支付数
衣服  点击数     下单数      支付数
电脑  点击数     下单数      支付数

例如，综合排名 = 点击数20%+下单数30%+支付数*50%

本项目需求优化为：先按照点击数排名，靠前的就排名高；如果点击数相同，再比较下单数；下单数再相同，就比较支付数。

实现方案一

需求分析

分别统计每个品类点击的次数，下单的次数和支付的次数：
（品类，点击总数）（品类，下单总数）（品类，支付总数）

需求实现

package com.atguigu.bigdata.spark.core.rdd.reqimport org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object Spark01_Req1_HotCateforyTop10Analysis {def main(args: Array[String]): Unit = {//TODO : Top热门品类val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("HotCateforyTop10Analysis")val sc = new SparkContext(sparkConf)//1. 读取原始日志数据val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")//2. 统计品类的点击数量（品类ID，点击数量）val clickActionRDD: RDD[String] = actionRDD.filter(action => {val datas: Array[String] = action.split("_")datas(6) != "-1"})val clickCountRDD: RDD[(String, Int)] = clickActionRDD.map(action => {val datas: Array[String] = action.split("_")(datas(6), 1)}).reduceByKey(_ + _)//3. 统计品类的下单数量（品类ID，下单数量）val orderActionRDD: RDD[String] = actionRDD.filter(action => {val datas: Array[String] = action.split("_")datas(8) != "null"})val orderCountRDD: RDD[(String, Int)] = orderActionRDD.flatMap(action => {val datas: Array[String] = action.split("_")val cid = datas(8)val cids = cid.split(",")//扁平化cids.map(id => (id, 1))}).reduceByKey(_ + _)//4. 统计品类的支付数量（品类ID，支付数量）val payActionRDD: RDD[String] = actionRDD.filter(action => {val datas: Array[String] = action.split("_")datas(10) != "null"})val payCountRDD: RDD[(String, Int)] = payActionRDD.flatMap(action => {val datas: Array[String] = action.split("_")val cid = datas(10)val cids = cid.split(",")//扁平化cids.map(id => (id, 1))}).reduceByKey(_ + _)//5.将品类进行排序，并且取前十名// 点击数量排序，下单数量的排序，支付数量排序// 元组排序：先比较第一个，在比较第二个，依次类推// （品类ID， （点击数量，下单数量，支付数量））//join(不可以),zip（不可以）,leftOutJoin（不可以）,cogroup （可以）连接数据val cogroupRDD: RDD[(String, (Iterable[Int], Iterable[Int], Iterable[Int]))] =clickCountRDD.cogroup(orderCountRDD, payCountRDD)val analysisRDD = cogroupRDD.mapValues{case (clickIter, orderIter, payIter) => {var clickCnt = 0val iter1 = clickIter.iteratorif (iter1.hasNext){clickCnt = iter1.next()}var orderCnt = 0val iter2 = orderIter.iteratorif (iter2.hasNext){orderCnt = iter2.next()}var payCnt = 0val iter3 = payIter.iteratorif (iter3.hasNext){payCnt = iter3.next()}(clickCnt,orderCnt,payCnt)}}val resultRDD: Array[(String, (Int, Int, Int))] = analysisRDD.sortBy(_._2, false).take(10)//6.将结果采集到控制台resultRDD.foreach(println)sc.stop()}}

实现方案二

需求分析

一次性统计每个品类点击的次数，下单的次数和支付的次数：
（品类，（点击总数，下单总数，支付总数））

需求实现

package com.atguigu.bigdata.spark.core.rdd.reqimport org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object Spark02_Req1_HotCateforyTop10Analysis1 {def main(args: Array[String]): Unit = {//TODO : Top热门品类val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark02_Req1_HotCateforyTop10Analysis1")val sc = new SparkContext(sparkConf)//Q： actionRDD重复使用//Q：cogr性能可能较低//1. 读取原始日志数据val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")actionRDD.cache() //解决重复使用//2. 统计品类的点击数量（品类ID，点击数量）val clickActionRDD: RDD[String] = actionRDD.filter(action => {val datas: Array[String] = action.split("_")datas(6) != "-1"})val clickCountRDD: RDD[(String, Int)] = clickActionRDD.map(action => {val datas: Array[String] = action.split("_")(datas(6), 1)}).reduceByKey(_ + _)//3. 统计品类的下单数量（品类ID，下单数量）val orderActionRDD: RDD[String] = actionRDD.filter(action => {val datas: Array[String] = action.split("_")datas(8) != "null"})val orderCountRDD: RDD[(String, Int)] = orderActionRDD.flatMap(action => {val datas: Array[String] = action.split("_")val cid = datas(8)val cids = cid.split(",")//扁平化cids.map(id => (id, 1))}).reduceByKey(_ + _)//4. 统计品类的支付数量（品类ID，支付数量）val payActionRDD: RDD[String] = actionRDD.filter(action => {val datas: Array[String] = action.split("_")datas(10) != "null"})val payCountRDD: RDD[(String, Int)] = payActionRDD.flatMap(action => {val datas: Array[String] = action.split("_")val cid = datas(10)val cids = cid.split(",")//扁平化cids.map(id => (id, 1))}).reduceByKey(_ + _)//5.将品类进行排序，并且取前十名// 点击数量排序，下单数量的排序，支付数量排序// 元组排序：先比较第一个，在比较第二个，依次类推// （品类ID， （点击数量，下单数量，支付数量））//join(不可以),zip（不可以）,leftOutJoin（不可以）,cogroup （可以）连接数据//cogroup有可能存在shuffle//换一个方式实现 （品类ID， 点击数量）=>（品类ID， （点击数量））=>（品类ID， （点击数量，0，0）） （品类ID， 下单数量）=>（品类ID， （下单数量））=>（品类ID， （0，下单数量，0）） （品类ID， 支付数量）=>（品类ID， （支付数量））=>（品类ID， （0，0，支付数量））//之后再两两聚合val rdd1 = clickCountRDD.map{case (cid, cnt) => {(cid, (cnt,0,0))}}val rdd2 = orderCountRDD.map{case (cid, cnt) => {(cid, (0,cnt,0))}}val rdd3 = payCountRDD.map{case (cid, cnt) => {(cid, (0,0,cnt))}}//将三个数据源合并在一起，统一进行聚合计算val sourceRDD: RDD[(String, (Int, Int, Int))] = rdd1.union(rdd2).union(rdd3)val analysisRDD: RDD[(String, (Int, Int, Int))] = sourceRDD.reduceByKey((t1, t2) => {(t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)})val resultRDD: Array[(String, (Int, Int, Int))] = analysisRDD.sortBy(_._2, false).take(10)//6.将结果采集到控制台resultRDD.foreach(println)sc.stop()}}

实现方案三

需求分析

使用累加器的方式聚合数据

需求实现

package com.atguigu.bigdata.spark.core.rdd.reqimport org.apache.spark.rdd.RDD
import org.apache.spark.util.AccumulatorV2
import org.apache.spark.{SparkConf, SparkContext}import scala.collection.mutableobject Spark04_Req1_HotCateforyTop10Analysis3 {def main(args: Array[String]): Unit = {//TODO : Top热门品类val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark03_Req1_HotCateforyTop10Analysis2")val sc = new SparkContext(sparkConf)//Q：存在shuffle操作（reduceByKey）//使用累加器//1. 读取原始日志数据val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")val acc = new HotCategoryAccumulatorsc.register(acc,"HotCategory")//2. 将数据转换结构actionRDD.foreach(action => {val datas: Array[String] = action.split("_")if (datas(6) != "-1") {//点击场合acc.add((datas(6),"click"))} else if (datas(8) != "null") {//下单的场合val ids: Array[String] = datas(8).split(",")ids.foreach(id => {acc.add((id,"order"))})} else if (datas(10) != "null") {//支付的场合val ids = datas(10).split(",")ids.foreach(id => {acc.add((id,"order"))})}})val accVal: mutable.Map[String, HotCategory] = acc.valueval categories: mutable.Iterable[HotCategory] = accVal.map(_._2)val sort: List[HotCategory] = categories.toList.sortWith((left, right) => {if (left.clickCnt > right.clickCnt) {true} else if (left.clickCnt == right.clickCnt) {if (left.orderCnt > right.orderCnt) {true} else if (left.orderCnt == right.orderCnt) {if (left.parCnt >= right.parCnt) {true} else {false}}else {false}}else {false}})//5.将结果采集到控制台sort.take(10).foreach(println)sc.stop()}case class HotCategory(cid: String, var clickCnt: Int, var orderCnt: Int, var parCnt: Int)/*** 自定义累加器* 1.继承AccumulatorV2，定义泛型*    IN: （品类ID，行为类型）*    OUT：mutable.Map[String,HotCategory]** 2. 重写方法（6个）*/class HotCategoryAccumulator extends AccumulatorV2[(String,String),mutable.Map[String,HotCategory]]{private val hcMap = mutable.Map[String,HotCategory]()override def isZero: Boolean = {hcMap.isEmpty}override def copy(): AccumulatorV2[(String, String), mutable.Map[String, HotCategory]] = {new HotCategoryAccumulator}override def reset(): Unit = {hcMap.clear()}override def add(v: (String, String)): Unit = {val cid: String = v._1val actionType: String = v._2val category: HotCategory = hcMap.getOrElse(cid, HotCategory(cid, 0, 0, 0))if (actionType == "click"){category.clickCnt += 1}else if (actionType == "order"){category.orderCnt += 1}else if (actionType == "pay"){category.parCnt += 1}hcMap.update(cid,category)}override def merge(other: AccumulatorV2[(String, String), mutable.Map[String, HotCategory]]): Unit = {val map1 = this.hcMapval map2 = other.valuemap2.foreach{case (cid, hc) => {val category: HotCategory = map1.getOrElse(cid, HotCategory(cid, 0, 0, 0))category.clickCnt += hc.clickCntcategory.orderCnt += hc.orderCntcategory.parCnt += hc.parCntmap1.update(cid,category)}}}override def value: mutable.Map[String, HotCategory] = hcMap}
}

需求 2：Top10 热门品类中每个品类的 Top10 活跃Session 统计

需求说明

在需求一的基础上，增加每个品类用户session 的点击统计

需求分析

根据品类ID和Session进行点击量的统计，将统计的结果进行转换
（（品类ID，sessionID ），sum） => （品类ID，（sessionID， sum））

需求实现

package com.atguigu.bigdata.spark.core.rdd.reqimport org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}object Spark05_Req2_HotCateforyTop10SessionAnalysis2 {def main(args: Array[String]): Unit = {//TODO : Top热门品类val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("Spark05_Req2_HotCateforyTop10SessionAnalysis2")val sc = new SparkContext(sparkConf)val actionRDD: RDD[String] = sc.textFile("datas/user_visit_action.txt")actionRDD.cache()val top10Ids: Array[String] = top10Category(actionRDD)//1. 过滤原始数据，保留点击和前10品类IDval filterActionRDD: RDD[String] = actionRDD.filter(action => {val datas = action.split("_")if (datas(6) != "-1") {top10Ids.contains(datas(6))} else {false}})//2. 根据品类ID和Session进行点击量的统计val reduceRDD: RDD[((String, String), Int)] = filterActionRDD.map(action => {val datas = action.split("_")((datas(6), datas(2)), 1)}).reduceByKey(_ + _)//3.将统计的结果进行转换//（（品类ID，sessionID ），sum） => （品类ID， （sessionID， sum））val mapRDD = reduceRDD.map {case ((cid, sid),sum) => {(cid, (sid,sum))}}//4. 相同的品类进行分组val groupRDD: RDD[(String, Iterable[(String, Int)])] = mapRDD.groupByKey()//5. 将分组后的数据进行点击量的排序，取前10名val resultRDD: RDD[(String, List[(String, Int)])] = groupRDD.mapValues(iter => {iter.toList.sortBy(_._2)(Ordering.Int.reverse).take(10)})resultRDD.collect().foreach(println)sc.stop()}def top10Category(actionRDD: RDD[String])  = {val flatRDD: RDD[(String, (Int, Int, Int))] = actionRDD.flatMap(action => {val datas: Array[String] = action.split("_")if (datas(6) != "-1") {//点击场合List((datas(6), (1, 0, 0)))} else if (datas(8) != "null") {//下单的场合val ids: Array[String] = datas(8).split(",")ids.map(id => (id, (0, 1, 0)))} else if (datas(10) != "null") {//支付的场合val ids = datas(10).split(",")ids.map(id => (id, (0, 0, 1)))} else {Nil}})val analysisRDD: RDD[(String, (Int, Int, Int))] = flatRDD.reduceByKey((t1, t2) => {(t1._1 + t2._1, t1._2 + t2._2, t1._3 + t2._3)})analysisRDD.sortBy(_._2, false).take(10).map(_._1)}}