最近由Reynold Xin给Spark开发者发布的一封邮件透露,Spark社区很有可能会跳过Spark 1.7版本的发布,而直接转向Spark 2.x。

  如果Spark 2.x发布,那么它将:
  (1)、Spark编译将默认使用Scala 2.11,但是还是会支持Scala 2.10。
  (2)、移除对Hadoop 1.x的支持。不过也有可能移除对Hadoop 2.2以下版本的支持,因为Hadoop 2.0和2.1版本分别是alpha和beta;甚至直接不支持Hadoop 2.6以下版本了。
  (3)、在Spark 1.x里面标记为deprecated的interfaces, configs, and modules (e.g. Bagel)将会被移除;
  (4)、从streaming中移除对Akka的依赖;

  (5)、移除Guava的依赖。

详情参见邮件内容:

I’m starting a new thread since the other one got intermixed with feature requests. Please refrain from making feature request in this thread. Not that we shouldn’t be adding features, but we can always add features in 1.7, 2.1, 2.2, ...

First - I want to propose a premise for how to think about Spark 2.0 and major releases in Spark, based on discussion with several members of the community: a major release should be low overhead and minimally disruptive to the Spark community. A major release should not be very different from a minor release and should not be gated based on new features. The main purpose of a major release is an opportunity to fix things that are broken in the current API and remove certain deprecated APIs (examples follow).

For this reason, I would *not* propose doing major releases to break substantial API's or perform large re-architecting that prevent users from upgrading. Spark has always had a culture of evolving architecture incrementally and making changes - and I don't think we want to change this model. In fact, we’ve released many architectural changes on the 1.X line.

If the community likes the above model, then to me it seems reasonable to do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of major releases every 2 years seems doable within the above model.

Under this model, here is a list of example things I would propose doing in Spark 2.0, separated into APIs and Operation/Deployment:

APIs

1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 1.x.

2. Remove Akka from Spark’s API dependency (in streaming), so user applications can use Akka (SPARK-5293). We have gotten a lot of complaints about user applications being unable to use Akka due to Spark’s dependency on Akka.

3. Remove Guava from Spark’s public API (JavaRDD Optional).

4. Better class package structure for low level developer API’s. In particular, we have some DeveloperApi (mostly various listener-related classes) added over the years. Some packages include only one or two public classes but a lot of private classes. A better structure is to have public classes isolated to a few public packages, and these public packages should have minimal private classes for low level developer APIs.

5. Consolidate task metric and accumulator API. Although having some subtle differences, these two are very similar but have completely different code path.

6. Possibly making Catalyst, Dataset, and DataFrame more general by moving them to other package(s). They are already used beyond SQL, e.g. in ML pipelines, and will be used by streaming also.

Operation/Deployment

1. Scala 2.11 as the default build. We should still support Scala 2.10, but it has been end-of-life.

2. Remove Hadoop 1 support.

3. Assembly-free distribution of Spark: don’t require building an enormous assembly jar in order to run Spark.

Spark社区可能放弃Spark 1.7而直接发布Spark 2.x相关推荐

  1. 合并Spark社区代码的正确姿势

    原创文章,转载请保留出处 最近刚刚忙完Spark 2.2.0的性能测试及Bug修复,社区又要发布2.1.2了,国庆期间刚好有空,过了一遍2.1.2的相关JIRA,发现有不少重要修复2.2.0也能用上, ...

  2. spark 写tidb_优秀的数据工程师,怎么用Spark在TiDB上做OLAP分析

    TiDB 是一款定位于在线事务处理/在线分析处理的融合型数据库产品,实现了一键水平伸缩,强一致性的多副本数据安全,分布式事务,实时 OLAP 等重要特性. TiSpark 是 PingCAP 为解决用 ...

  3. spark sql合并小文件_如何比较Hive,Spark,Impala和Presto?

    Spark,Hive,Impala和Presto是基于SQL的引擎,Impala由Cloudera开发和交付.在选择这些数据库来管理数据库时,许多Hadoop用户会感到困惑.Presto是一个开放源代 ...

  4. 《Spark核心技术与高级应用》——3.2节构建Spark的开发环境

    本节书摘来自华章社区<Spark核心技术与高级应用>一书中的第3章,第3.2节构建Spark的开发环境,作者于俊 向海 代其锋 马海平,更多章节内容可以访问云栖社区"华章社区&q ...

  5. 【翻译】StreamDM:基于Spark Streaming的高级数据挖掘 StreamDM: Advanced Data Mining in Spark Streaming

    [翻译]StreamDM:基于Spark Streaming的高级数据挖掘 StreamDM: Advanced Data Mining in Spark Streaming 摘要 Abstract ...

  6. 慕课网Spark SQL日志分析 - 4.从Hive平滑过渡到Spark SQL

    4.1 SQLContext/HiveContext/SparkSesson 1.SQLContext 老版本文档:spark.apache.org/docs/1.6.1/ SQLContext示例文 ...

  7. spark启动master和slave时报错:org.apache.spark.deploy.master.Master --host localhost --port 7077 --webui-po

    总共三台服务器,一个master,两个slave,运行start-all.sh时报错,提示的错误信息如下: failed to launch: nice -n 0 /home/zhanghaiping ...

  8. spark 逻辑回归算法案例_黄美灵的Spark ML机器学习实战

    原标题:黄美灵的Spark ML机器学习实战 本课程主要讲解基于Spark 2.x的ML,ML是相比MLlib更高级的机器学习库,相比MLlib更加高效.快捷:ML实现了常用的机器学习,如:聚类.分类 ...

  9. python spark社区_Spark中文python文档

    Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. 外文地址: S ...

最新文章

  1. 【每日一算法】填充同一层的兄弟节点
  2. 情人节——微信朋友圈浓浓爱意的9张拼图(HTML版本)
  3. java怎么显示qt文件后缀,在qt中执行java文件
  4. VS2010下Boost1.55.0配置
  5. [渝粤教育] 中国地质大学 企业文化建设与管理 复习题
  6. NXP UWB NCJ29D5开发(一)环境搭建
  7. Java:IDEA下使用JUNIT
  8. Go语言编程:Go语言实现快速排序算法
  9. element ui table组件扩展关于列表编辑按钮的位置放置
  10. 开源和devops_2016年开源工作报告:需要开发人员,DevOps和认证
  11. 笔记︱精准营销解决方案以及营销组合评估
  12. MFC单文档多视图程序设计与Splitter拆分窗口
  13. win7虚拟机详细搭建过程
  14. 【Spring Securtiy】A granted authority textual representation is required
  15. 如何关闭打开文件安全警告
  16. 【RDMA】19. RDMA之iWARP Soft-iWARP
  17. 《为你打开一扇门》| 赵丽宏
  18. UE4材质 制作UV贴图
  19. 北极寒流带来《后天》享受(组图)零下50度美国城市成灾区出门都犯法
  20. 通信总线协议五 :CAN

热门文章

  1. 微服务架构学习 之 什么是微服务
  2. eclipse lib中包不能打开_Eclipse环境搭建
  3. sql语言和php,SQL语言快速入门(三)_php
  4. linux的memmap函数_linux /proc下的statm、maps、memmap 内存信息文件分析
  5. java对象的访问定位_JVM创建对象及访问定位过程详解
  6. Java中如何执行source命令,在Java中运行UNIX Source命令
  7. java show过时_Java中show() 方法被那个方法代替了? java编程 显示类中信
  8. 解决 pandas 读取数据时内存过大的问题
  9. 六、Hive中的内部表、外部表、分区表和分桶表
  10. 环形公路堵车概率模型