Scala’s parallel collections

2019独角兽企业重金招聘Python工程师标准>>>

Scala 2.9 introduced parallel collections, which mirror most of the existing collections with a parallel version. Collections that have been parallelized this way have received a new method called par which magically parallelize certain operations on this collection.

For example, here is a sequential version:

scala> (1 to 5) foreach println

1

2

3

4

5

And the parallel version (notice the extra par keyword):

scala> (1 to 5).par foreach println

1

4

3

2

5

Obviously, the ordering will change each time you run the parallel version.

This piqued my curiosity and I decided to dig a bit further, starting with investigating what exactly is happening behind the scenes.

First of all, the parallel collections are based on an article called “A Generic Parallel Collection Framework”, by Martin Odersky, Aleksander Prokopec et al, which I highly recommend. It’s a very interesting analysis of how to decompose the concepts of parallelism, concurrency, collections, ranges and iterators and assemble them in a generic manner.

Sadly, this article ended up being the only highlight of my research in this area, because the more I dug into the Scala parallel collections, the more disappointed I became. By now, I am struggling to find a good use case for parallel collections, and I’m hoping that this article will generate some positive responses about their use.

Here are some of the problems that I found with the parallel collections, starting with the one I think is the most important one.

Lack of control

My first reaction when I saw the output above was to try to verify that indeed, threads were being spawned, and then find out how many of them, how I can control the size of the thread pool, etc…

I came up pretty much empty on all counts, and if I have missed a piece of the documentation that explains this, I would love to see it, but browsing the sources of ParSeq and other classes produced no useful result.

This is a big deal, and probably the worst problem with this framework. The loop above generated a parallel range of five entries, did it generate five threads? What happens if I try with 1000? 100000? The answer: it works for all these values, which makes me think that the loop is not allocating one thread per value. So it’s using a thread pool. But again: what size? Is that size configurable? How about other characteristics of that thread pool?

Digging deeper, what are the saturation and rejection policies? If the pool contains ten threads, what happens when it receives an eleventh value? It probably blocks, but can this be configured? Can the dispatch strategy be configured? Maybe I’m feeding operations of diverse durations and I want to make sure that the expensive operations don’t starve the faster ones, how can I do this?

This absence of configuration is a big blow to the parallel framework, and it relegates its usage to the simplest cases, where it will most likely not bring much speed gain compared to sequential execution.

Silver bullet illusion

Over the past months, I have seen quite a few users pop in the #scala channel and complain that parallel collections are not working. Taking a closer look at their code, it usually becomes quickly obvious that their algorithm is not parallelizable, and either 1) they didn’t realize it or 2) they were aware of that fact but they got the impression that par would magically take care of it.

Here is a quick example:

scala> Set(1,2,3,4,5) mkString(" ")

res149: String = 5 1 2 3 4

scala> Set(1,2,3,4,5).par mkString(" ")

res149: String = 5 1 2 3 4

You can run the par version over and over, the result will remain the same. This is confusing. Note that I used a Set this time, which indicates that I don’t care for the ordering of my collection. CallingmkString on the sequential version of my set reflects this. With this in mind, I would hope that callingmkString on the parallel version of my set would randomize its output, but that’s not what’s happening: I’m getting the same result as the sequential version, over and over.

It should be obvious that not all operations on collections can be parallelized (e.g. folds) but it looks like creating a string out of a set should be, and it’s not. I’m not going to go too far down here because the explanation is a mix of implementation details and theoretical considerations (the catamorphic nature of folds, set, the Scala inheritance hierarchy and the mkString specification), but the key point here is that the parallelization of collections can lead to non-intuitive results.

Bloat

I think the decision to retrofit the existing collections with the par operation was a mistake. Parallel operations come with a set of constraints that are not widely applicable to sequential collections, which leads to the situation that not all the collections support par (e.g. there is no ParTraversable) and more importantly, it imposes a burden on everyone, including people who don’t care for this functionality.

In doing this, Scala violates what I consider a fairly important rule for programming languages and API’s in general: you shouldn’t pay for what you don’t use. Not only do the parallel collections add a few megs to a jar file that’s already fairly big, but they probably introduce a great deal of complexity that is going to impact the maintainers of the collections (both sequential and parallel). It looks like anyone who will want to make modifications to the sequential collections will have to make sure their code is not breaking the parallel collections, and vice versa.

Unproven gains

Scala 2.9 is still very recent so it’s no surprise that we don’t really have any quantitative feedback of real world gains, but I’ll make a prediction today that the courageous developers who will decide to embrace the parallel collections wholeheartedly across their code base will see very little gains. In my experience, inner loops are hardly ever the bottleneck in large code bases, and I’d even go further in suspecting that spawning threads for elements of a loop could have adverse effects (context switching, memory thrashing, cache misses) for loops that iterate on very few elements or that are executing very fast operations already. I’m mostly speculating here, I haven’t run any measurements, so I could be completely wrong.

Remedies

Because of all these problems, I am a bit underwhelmed by the usefulness of the parallel collection framework overall, maybe someone who has a more extensive experience with it can chime in to share the benefits they reaped from it.

I have a couple of suggestions that I think might be a better path for this kind of initiative:

Split up the parallel and sequential collections, remove par and make sure that both hierarchies can be evolved independently of each other.
Provide a nice Scala wrapper around the Executor framework. Executors have everything that anyone interested in low level parallelism can dream of: configurable thread pool sizes, and even thread pools themselves, thread factories, saturation and rejection policies, lifecycle hooks, etc… You could write a Scala wrapper around this framework in a few hundreds of lines and it would be much more useful than what is currently possible with par.

标签标签: ，，

转载于:https://my.oschina.net/aiguozhe/blog/39400

Scala’s parallel collections相关推荐

Scala与Golang的并发实现对比----好问
Scala与Golang的并发实现对比并发语言俨然是应大规模应用架构的需要而提出,有其现实所需.前后了解了Scala和Golang,深深体会到现代并发语言与旧有的Java.C++等语言在风格及理念上 ...
Pragmatic Scala：Create Expressive, Concise, and Scalable Applications （读书笔记）
Pragmatic Scala:Create Expressive, Concise, and Scalable Applications 目录 1 From Java to Scala 2 Obje ...
Scala 的数据结构
a.类型参数化 Scala 的类型参数化是指在定义类.函数时,参数的数据类型并不明确,需要在创建具体的实例或调用函数时才可以确定,这时,可以用一个占位符(通常为A ~ Z中的单个字符)来替代,这类似于 ...
Scala Actor,receive不断接收消息，react复用线程，结合case class的actor,Future使用,使用Actor进行wordCount
Scala Actor 二. 什么是Scala Actor 1．概念 Scala中的Actor能够实现并行编程的强大功能,它是基于事件模型的并发机制,Scala是运用消息(message)的发送.接 ...
Scala教程之:函数式的Scala
文章目录高阶函数强制转换方法为函数方法嵌套多参数列表样例类比较拷贝模式匹配密封类单例对象伴生对象正则表达式模式 For表达式 Scala是一门函数式语言,接下来我们会讲一下几个 ...
Scala入门到精通——第二十八节 Scala与JAVA互操作
本节主要内容 JAVA中调用Scala类 Scala中调用JAVA类 Scala类型参数与JAVA泛型互操作 Scala与Java间的异常处理互操作 1. JAVA中调用Scala类 Java可以直接 ...
Scala学习之爬豆瓣电影
简单使用Scala和Jsoup对豆瓣电影进行爬虫,技术比較简单易学. 写文章不易,欢迎大家採我的文章,以及给出实用的评论,当然大家也能够关注一下我的github:多谢. 1.爬虫前期准备找好须要抓取 ...
Scala paralle
list.par def par: scala.collection.parallel.immutable.ParSeq[Int] val list = List(1,2,3,4,5) list.pa ...
scala 与 spark 并行化
1. .par普通集合转换为并行集合 scala.collection: scala> (1 to 5).foreach(println(_)) 12345scala> (1 to 5). ...

Scala’s parallel collections

Scala’s parallel collections相关推荐

最新文章

热门文章