Skewed Join Optimization
当join两个大表的时候,对于其中较大的一个表存在少量倾斜很严重的key的时候,可以将这部分key先提取出来(distinct (key))和另外一个表join作为后续map join的小表来用。和下面的思想类似,分而治之。
Optimizing Skewed Joins
The Problem
A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer.
e.g., Suppose we have table A with a key column, "id" which has values 1, 2, 3 and 4, and table B with a similar column, which has values 1, 2 and 3.
We want to do a join corresponding to the following query
- select A.id from A join B on A.id = B.id
A set of Mappers read the tables and gives them to Reducers based on the keys. e.g., rows with key 1 go to Reducer R1, rows with key 2 go to Reducer R2 and so on. These Reducers do a cross product of the values from A and B, and write the output. The Reducer R4 gets rows from A, but will not produce any results.
Now let's assume that A was highly skewed in favor of id = 1. Reducers R2 and R3 will complete quickly but R1 will continue for a long time, thus becoming the bottleneck. If the user has information about the skew, the bottleneck can be avoided manually as follows:
Do two separate queries
- select A.id from A join B on A.id = B.id where A.id <> 1;
- select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;
The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.
- Advantages
- If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks.
- Disadvantages
- The tables A and B have to be read and processed twice.
- Because of the partial results, the results also have to be read and written twice.
- The user needs to be aware of the skew in the data and manually do the above process.
We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:
- If it has key 1, then use the hashed version of B to compute the result.
- For all other keys, send it to a reducer which does the join. This reducer will get rows of B also from a mapper.
This way, we end up reading only B twice. The skewed keys in A are only read and processed by the Mapper, and not sent to the reducer. The rest of the keys in A go through only a single Map/Reduce.
The assumption is that B has few rows with keys which are skewed in A. So these rows can be loaded into the memory.
Skewed Join Optimization相关推荐
- spark触发adaptive skewed join的例子code
1. 启动spark-shell,参数如下: spark-shell --conf spark.driver.allowMultipleContexts=true --conf spark.sql.a ...
- MySQL查询优化之七-左Join 和右Join 优化(Left Join and Right Join Optimization)
MySQL查询优化之七-左Join 和右Join 优化(Left Join and Right Join Optimization) 如需转载请标明出处:http://blog.csdn.net/it ...
- Spark Skew Join Optimization
数据倾斜在分布式计算中是一个很常见的问题,Spark提供了一种比较便捷的方法来处理一些简单的数据倾斜场景. Spark中定位数据倾斜 1.找到耗时长的stage并确定为shuffle stage. 2 ...
- Hadoop运维记录系列(三)
Hive 0.10发布了,修正了一些bug,搞了一些新特性,对提高工作效率很有帮助,于是尝试升级了一下,然后遇到了一些问题,记录一下. 主要是看上了下面几个feature,打算换上看看. 1. All ...
- Spark3-AQE-数据倾斜Join优化
Adaptive Query Exection(自适应查询计划)简称AQE,在最早在spark 1.6版本就已经有了AQE;到了spark 2.x版本,intel大数据团队进行了相应的原型开发和实践: ...
- Hive入门详解操作
Hive 第一章 Hive简介 1.1. Hive的简介 1.1.1 hive出现的原因 FaceBook网站每天产生海量的结构化日志数据,为了对这些数据进行管理,并且因为机器学习的需求,产生了hiv ...
- Adaptive Query Execution: Speeding Up Spark SQL at Runtime
This is a joint engineering effort between the Databricks Apache Spark engineering team - Wenchen Fa ...
- 全方位揭秘!大数据从0到1的完美落地之Hive企业级调优
Hive企业级调优 调优原则已经在MR优化阶段已经有核心描述,优化Hive可以按照MR的优化思路来执行 优化的主要考虑方面: 环境方面:服务器的配置.容器的配置.环境搭建 具体软件配置参数: 代码级别 ...
- 看懂mysql执行计划--官方文档
原文地址:https://dev.mysql.com/doc/refman/5.7/en/explain-output.html 9.8.2 EXPLAIN Output Format The EXP ...
最新文章
- jquery中not方法失效的解决方案
- 学习笔记-Redis设计与实现-链表
- 8086汇编学习小记-王爽汇编语言实验12
- 【采用】机器学习在金融大数据风险建模中的应用
- Python之pandas:pandas中缺失值与空值处理的简介及常用函数(drop()、dropna()、isna()、isnull()、fillna())函数详解之详细攻略
- linux ssh无需密码,linux下 ssh 实现无需密码的远程登陆
- 分布与并行计算—用任务管理器画CPU正弦曲线(Java)
- 从软件交付看软件验收管理
- UVA 12904 Load Balancing 暴力
- 解决Ajax中IE浏览器缓存问题
- [译]为什么Vue不支持templateURL
- Extjs GRID表格组件使用小结
- 写了个算分压电阻阻值的MATLAB小程序
- python爬虫贴吧_Python爬虫——抓取贴吧帖子
- Word怎么压缩变小?压缩word文档不妨试试这个方法
- Codeforces Beta Round #94 (Div. 1 Only)A. Statues
- Windows下Pidgin介绍/安装配置图文攻略
- Javaer换坑指南之Linux
- 10、wpf显示图片方式一: Image控件
- 抓包工具之httpwatch的使用