当join两个大表的时候,对于其中较大的一个表存在少量倾斜很严重的key的时候,可以将这部分key先提取出来(distinct (key))和另外一个表join作为后续map join的小表来用。和下面的思想类似,分而治之。

Optimizing Skewed Joins

The Problem

A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer.

e.g., Suppose we have table A with a key column, "id" which has values 1, 2, 3 and 4, and table B with a similar column, which has values 1, 2 and 3.
We want to do a join corresponding to the following query

  • select A.id from A join B on A.id = B.id

A set of Mappers read the tables and gives them to Reducers based on the keys. e.g., rows with key 1 go to Reducer R1, rows with key 2 go to Reducer R2 and so on. These Reducers do a cross product of the values from A and B, and write the output. The Reducer R4 gets rows from A, but will not produce any results.

Now let's assume that A was highly skewed in favor of id = 1. Reducers R2 and R3 will complete quickly but R1 will continue for a long time, thus becoming the bottleneck. If the user has information about the skew, the bottleneck can be avoided manually as follows:

Do two separate queries

  • select A.id from A join B on A.id = B.id where A.id <> 1;
  • select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;

The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.

  • Advantages

    • If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks.
  • Disadvantages

    • The tables A and B have to be read and processed twice.
    • Because of the partial results, the results also have to be read and written twice.
    • The user needs to be aware of the skew in the data and manually do the above process.

We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

  • If it has key 1, then use the hashed version of B to compute the result.
  • For all other keys, send it to a reducer which does the join. This reducer will get rows of B also from a mapper.

This way, we end up reading only B twice. The skewed keys in A are only read and processed by the Mapper, and not sent to the reducer. The rest of the keys in A go through only a single Map/Reduce.

The assumption is that B has few rows with keys which are skewed in A. So these rows can be loaded into the memory.

Skewed Join Optimization相关推荐

  1. spark触发adaptive skewed join的例子code

    1. 启动spark-shell,参数如下: spark-shell --conf spark.driver.allowMultipleContexts=true --conf spark.sql.a ...

  2. MySQL查询优化之七-左Join 和右Join 优化(Left Join and Right Join Optimization)

    MySQL查询优化之七-左Join 和右Join 优化(Left Join and Right Join Optimization) 如需转载请标明出处:http://blog.csdn.net/it ...

  3. Spark Skew Join Optimization

    数据倾斜在分布式计算中是一个很常见的问题,Spark提供了一种比较便捷的方法来处理一些简单的数据倾斜场景. Spark中定位数据倾斜 1.找到耗时长的stage并确定为shuffle stage. 2 ...

  4. Hadoop运维记录系列(三)

    Hive 0.10发布了,修正了一些bug,搞了一些新特性,对提高工作效率很有帮助,于是尝试升级了一下,然后遇到了一些问题,记录一下. 主要是看上了下面几个feature,打算换上看看. 1. All ...

  5. Spark3-AQE-数据倾斜Join优化

    Adaptive Query Exection(自适应查询计划)简称AQE,在最早在spark 1.6版本就已经有了AQE;到了spark 2.x版本,intel大数据团队进行了相应的原型开发和实践: ...

  6. Hive入门详解操作

    Hive 第一章 Hive简介 1.1. Hive的简介 1.1.1 hive出现的原因 FaceBook网站每天产生海量的结构化日志数据,为了对这些数据进行管理,并且因为机器学习的需求,产生了hiv ...

  7. Adaptive Query Execution: Speeding Up Spark SQL at Runtime

    This is a joint engineering effort between the Databricks Apache Spark engineering team - Wenchen Fa ...

  8. 全方位揭秘!大数据从0到1的完美落地之Hive企业级调优

    Hive企业级调优 调优原则已经在MR优化阶段已经有核心描述,优化Hive可以按照MR的优化思路来执行 优化的主要考虑方面: 环境方面:服务器的配置.容器的配置.环境搭建 具体软件配置参数: 代码级别 ...

  9. 看懂mysql执行计划--官方文档

    原文地址:https://dev.mysql.com/doc/refman/5.7/en/explain-output.html 9.8.2 EXPLAIN Output Format The EXP ...

最新文章

  1. jquery中not方法失效的解决方案
  2. 学习笔记-Redis设计与实现-链表
  3. 8086汇编学习小记-王爽汇编语言实验12
  4. 【采用】机器学习在金融大数据风险建模中的应用
  5. Python之pandas:pandas中缺失值与空值处理的简介及常用函数(drop()、dropna()、isna()、isnull()、fillna())函数详解之详细攻略
  6. linux ssh无需密码,linux下 ssh 实现无需密码的远程登陆
  7. 分布与并行计算—用任务管理器画CPU正弦曲线(Java)
  8. 从软件交付看软件验收管理
  9. UVA 12904 Load Balancing 暴力
  10. 解决Ajax中IE浏览器缓存问题
  11. [译]为什么Vue不支持templateURL
  12. Extjs GRID表格组件使用小结
  13. 写了个算分压电阻阻值的MATLAB小程序
  14. python爬虫贴吧_Python爬虫——抓取贴吧帖子
  15. Word怎么压缩变小?压缩word文档不妨试试这个方法
  16. Codeforces Beta Round #94 (Div. 1 Only)A. Statues
  17. Windows下Pidgin介绍/安装配置图文攻略
  18. Javaer换坑指南之Linux
  19. 10、wpf显示图片方式一: Image控件
  20. 抓包工具之httpwatch的使用

热门文章

  1. 数字经济的大航海时代
  2. Unity 3D : 解富士 RAF 檔案
  3. 10月25日 c语言 输入y=(sinx-cosx)/tanx
  4. PAT乙级-1048 数字加密
  5. 【嵌入式基础小知识】Nand Flash VS Nor Flash
  6. COB与COG各自优势
  7. 成都首秀,体验身边的AI——2019京东人工智能大会
  8. Oracle函数篇 - pivot行转列函数
  9. 字体圆润属性的使用-webkit-font-smoothing: antialiased
  10. Segment Anything使用手册(交互式数据标柱|自动数据标柱)