2019独角兽企业重金招聘Python工程师标准>>>

Scala 2.9 introduced parallel collections, which mirror most of the existing collections with a parallel version. Collections that have been parallelized this way have received a new method called par which magically parallelize certain operations on this collection.

For example, here is a sequential version:

scala> (1 to 5) foreach println

1

2

3

4

5

And the parallel version (notice the extra par keyword):

scala> (1 to 5).par foreach println

1

4

3

2

5

Obviously, the ordering will change each time you run the parallel version.

This piqued my curiosity and I decided to dig a bit further, starting with investigating what exactly is happening behind the scenes.

First of all, the parallel collections are based on an article called “A Generic Parallel Collection Framework”, by Martin Odersky, Aleksander Prokopec et al, which I highly recommend. It’s a very interesting analysis of how to decompose the concepts of parallelism, concurrency, collections, ranges and iterators and assemble them in a generic manner.

Sadly, this article ended up being the only highlight of my research in this area, because the more I dug into the Scala parallel collections, the more disappointed I became. By now, I am struggling to find a good use case for parallel collections, and I’m hoping that this article will generate some positive responses about their use.

Here are some of the problems that I found with the parallel collections, starting with the one I think is the most important one.

Lack of control

My first reaction when I saw the output above was to try to verify that indeed, threads were being spawned, and then find out how many of them, how I can control the size of the thread pool, etc…

I came up pretty much empty on all counts, and if I have missed a piece of the documentation that explains this, I would love to see it, but browsing the sources of ParSeq and other classes produced no useful result.

This is a big deal, and probably the worst problem with this framework. The loop above generated a parallel range of five entries, did it generate five threads? What happens if I try with 1000? 100000? The answer: it works for all these values, which makes me think that the loop is not allocating one thread per value. So it’s using a thread pool. But again: what size? Is that size configurable? How about other characteristics of that thread pool?

Digging deeper, what are the saturation and rejection policies? If the pool contains ten threads, what happens when it receives an eleventh value? It probably blocks, but can this be configured? Can the dispatch strategy be configured? Maybe I’m feeding operations of diverse durations and I want to make sure that the expensive operations don’t starve the faster ones, how can I do this?

This absence of configuration is a big blow to the parallel framework, and it relegates its usage to the simplest cases, where it will most likely not bring much speed gain compared to sequential execution.

Silver bullet illusion

Over the past months, I have seen quite a few users pop in the #scala channel and complain that parallel collections are not working. Taking a closer look at their code, it usually becomes quickly obvious that their algorithm is not parallelizable, and either 1) they didn’t realize it or 2) they were aware of that fact but they got the impression that par would magically take care of it.

Here is a quick example:

scala> Set(1,2,3,4,5) mkString(" ")

res149: String = 5 1 2 3 4

scala> Set(1,2,3,4,5).par mkString(" ")

res149: String = 5 1 2 3 4

You can run the par version over and over, the result will remain the same. This is confusing. Note that I used a Set this time, which indicates that I don’t care for the ordering of my collection. CallingmkString on the sequential version of my set reflects this. With this in mind, I would hope that callingmkString on the parallel version of my set would randomize its output, but that’s not what’s happening: I’m getting the same result as the sequential version, over and over.

It should be obvious that not all operations on collections can be parallelized (e.g. folds) but it looks like creating a string out of a set should be, and it’s not. I’m not going to go too far down here because the explanation is a mix of implementation details and theoretical considerations (the catamorphic nature of folds, set, the Scala inheritance hierarchy and the mkString specification), but the key point here is that the parallelization of collections can lead to non-intuitive results.

Bloat

I think the decision to retrofit the existing collections with the par operation was a mistake. Parallel operations come with a set of constraints that are not widely applicable to sequential collections, which leads to the situation that not all the collections support par (e.g. there is no ParTraversable) and more importantly, it imposes a burden on everyone, including people who don’t care for this functionality.

In doing this, Scala violates what I consider a fairly important rule for programming languages and API’s in general: you shouldn’t pay for what you don’t use. Not only do the parallel collections add a few megs to a jar file that’s already fairly big, but they probably introduce a great deal of complexity that is going to impact the maintainers of the collections (both sequential and parallel). It looks like anyone who will want to make modifications to the sequential collections will have to make sure their code is not breaking the parallel collections, and vice versa.

Unproven gains

Scala 2.9 is still very recent so it’s no surprise that we don’t really have any quantitative feedback of real world gains, but I’ll make a prediction today that the courageous developers who will decide to embrace the parallel collections wholeheartedly across their code base will see very little gains. In my experience, inner loops are hardly ever the bottleneck in large code bases, and I’d even go further in suspecting that spawning threads for elements of a loop could have adverse effects (context switching, memory thrashing, cache misses) for loops that iterate on very few elements or that are executing very fast operations already. I’m mostly speculating here, I haven’t run any measurements, so I could be completely wrong.

Remedies

Because of all these problems, I am a bit underwhelmed by the usefulness of the parallel collection framework overall, maybe someone who has a more extensive experience with it can chime in to share the benefits they reaped from it.

I have a couple of suggestions that I think might be a better path for this kind of initiative:

  • Split up the parallel and sequential collections, remove par and make sure that both hierarchies can be evolved independently of each other.
  • Provide a nice Scala wrapper around the Executor framework. Executors have everything that anyone interested in low level parallelism can dream of: configurable thread pool sizes, and even thread pools themselves, thread factories, saturation and rejection policies, lifecycle hooks, etc… You could write a Scala wrapper around this framework in a few hundreds of lines and it would be much more useful than what is currently possible with par.
标签 标签: , ,

转载于:https://my.oschina.net/aiguozhe/blog/39400

Scala’s parallel collections相关推荐

  1. Scala与Golang的并发实现对比----好问

    Scala与Golang的并发实现对比 并发语言俨然是应大规模应用架构的需要而提出,有其现实所需.前后了解了Scala和Golang,深深体会到现代并发语言与旧有的Java.C++等语言在风格及理念上 ...

  2. Pragmatic Scala:Create Expressive, Concise, and Scalable Applications (读书笔记)

    Pragmatic Scala:Create Expressive, Concise, and Scalable Applications 目录 1 From Java to Scala 2 Obje ...

  3. Scala 的数据结构

    a.类型参数化 Scala 的类型参数化是指在定义类.函数时,参数的数据类型并不明确,需要在创建具体的实例或调用函数时才可以确定,这时,可以用一个占位符(通常为A ~ Z中的单个字符)来替代,这类似于 ...

  4. Scala Actor,receive不断接收消息,react复用线程,结合case class的actor,Future使用,使用Actor进行wordCount

    Scala Actor 二. 什么是Scala Actor 1. 概念 Scala中的Actor能够实现并行编程的强大功能,它是基于事件模型的并发机制,Scala是运用消息(message)的发送.接 ...

  5. Scala教程之:函数式的Scala

    文章目录 高阶函数 强制转换方法为函数 方法嵌套 多参数列表 样例类 比较 拷贝 模式匹配 密封类 单例对象 伴生对象 正则表达式模式 For表达式 Scala是一门函数式语言,接下来我们会讲一下几个 ...

  6. Scala入门到精通——第二十八节 Scala与JAVA互操作

    本节主要内容 JAVA中调用Scala类 Scala中调用JAVA类 Scala类型参数与JAVA泛型互操作 Scala与Java间的异常处理互操作 1. JAVA中调用Scala类 Java可以直接 ...

  7. Scala学习之爬豆瓣电影

    简单使用Scala和Jsoup对豆瓣电影进行爬虫,技术比較简单易学. 写文章不易,欢迎大家採我的文章,以及给出实用的评论,当然大家也能够关注一下我的github:多谢. 1.爬虫前期准备 找好须要抓取 ...

  8. Scala paralle

    list.par def par: scala.collection.parallel.immutable.ParSeq[Int] val list = List(1,2,3,4,5) list.pa ...

  9. scala 与 spark 并行化

    1. .par普通集合转换为并行集合 scala.collection: scala> (1 to 5).foreach(println(_)) 12345scala> (1 to 5). ...

最新文章

  1. python中模块的__all__属性解读
  2. pta段错误怎么办_雅思50问 | 07:雅思写作时间不够怎么办?写不完会给分吗?...
  3. mysql 安装只有一半_记一次MySQL安装出现的坑爹问题。。。
  4. 使用 iosOverlay.js 创建 iOS 风格的提示和通知
  5. confluence的一次管理员密码
  6. 95-190-440-源码-window-Trigger-Trigger简介
  7. xssfworkbook excel打开为空白_「Excel」轻松运用 Excel 之“Excel 选项”的 4 个设置
  8. 原生CSS设置网站主题色—CSS变量赋值
  9. 拒绝瞎忙,高效的学习与工作经验谈
  10. kali linux查看局域网下所有IP,并对指定IP实施局域网内攻击
  11. 阿里云服务器地域的选择
  12. 抖音超火的动态图如何做 怎么制作GIF
  13. submit事件监听问题
  14. 行波iq调制器_行波型LiNbO3电光调制器的电极优化设计
  15. 2019牛客国庆集训派对day2 K 2018
  16. 互联网快讯:百世供应链发力汽配赛道;极米NEW Z6X升级版Z6X Pro上线;美的发布方舱医院集成化解决方案
  17. Linux设置ip地址不更改
  18. Your configuration specifies to merge with the ref from the remote, but no such ref was fetched
  19. LeetCode刷题记录——第892题(三维形体的表面积)
  20. [转载]将Excel中的图片URL地址转成图片文件

热门文章

  1. 计算各种形钢的重量用什么软件_造价常用工具不会用,30个常用工程算量工具免费送,速来领取收藏...
  2. python操作yaml的方法详解
  3. Python表白代码:“ 星光月夜烟花 皆归你,我也归你”
  4. zabbix4.0添加mysql报警_部署监控三剑客 Zabbix4.0 监控以及告警机制
  5. c#和python_IronPython和C#交互
  6. 笔记本计算机硬件知识,知识和经验:笔记本计算机的基本知识_计算机硬件和网络_IT /计算机_信息...
  7. mysql数据库导出mdf文件_数据库 导出mdf
  8. matlab 编arm_Matlab将实现STM32的代码生成、调试及建模功能
  9. 手写体数字识别(理解起来更简单一点)
  10. PTA基础编程题目集-6-12 判断奇偶性