RDD, DataFrame or Dataset

@(SPARK)[spark]

文章主要内容来自：
https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html
http://www.agildata.com/apache-spark-rdd-vs-dataframe-vs-dataset/

总结：
1、RDD是一个java对象的集合。RDD的优点是更面向对象，代码更容易理解。但在需要在集群中传输数据时需要为每个对象保留数据及结构信息，这会导致数据的冗余，同时这会导致大量的GC。
2、DataFrame是在1.3引入的，它包含数据与schema2部分信息，其中数据就是真正的数据，而不是一个java对象。它不容易理解，同时对java支持不好，还有一个缺点是非强类型，这会导致部分错误在运行时才会发现。优点是数据不需要加载到一个java对象，减少GC，大大优化了数据在集群间传播与本地序列化的效率。
3、DataSet在1.6引入了预览版，在2.0才真正稳定。它试图整合RDD/DataFrame的优点。在2.0里对DataSet的定位是：（1）DataFrame只是一个type alias，真正实现都是DataSet。（2）对于python和R这些非类型安全的语言，DataFrame仍是主要编程接口。

Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. map, filter, groupByKey) and the untyped methods (e.g. select, groupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming. Since compile-time type-safety in Python and R is not a language feature, the concept of Dataset does not apply to these languages’ APIs. Instead, DataFrame remains the primary programing abstraction, which is analogous to the single-node data frame notion in these languages. Get a peek from a Dataset API notebook.
DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.

There Are Now 3 Apache Spark APIs. Here’s How to Choose the Right One
Apache Spark is evolving at a rapid pace, including changes and additions to core APIs. One of the most disruptive areas of change is around the representation of data sets. Spark 1.0 used the RDD API but in the past twelve months, two new alternative and incompatible APIs have been introduced. Spark 1.3 introduced the radically different DataFrame API and the recently released Spark 1.6 release introduces a preview of the new Dataset API.
Many existing Spark developers will be wondering whether to jump from RDDs directly to the Dataset API, or whether to first move to the DataFrame API. Newcomers to Spark will have to choose which API to start learning with.
This article provides an overview of each of these APIs, and outlines the strengths and weaknesses of each one. A companion github repository provides working examples that are a good starting point for experimentation with the approaches outlined in this article.

Talk to a Spark expert. Contact Us.

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release. This interface and its Java equivalent, JavaRDD, will be familiar to any developers who have worked through the standard Spark tutorials. From a developer’s perspective, an RDD is simply a set of Java or Scala objects representing data.
The RDD API provides many transformation methods, such as map(), filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

Example of RDD transformations and actions
Scala:

rdd.filter(_.age > 21)              // transformation.map(_.last)                    // transformation                   .saveAsObjectFile("under21.bin") // actionrdd.filter(_.age > 21)              // transformation.map(_.last)         // transformation.saveAsObjectFile("under21.bin") // action

Java:

rdd.filter(p -> p.getAge() < 21)    // transformation.map(p -> p.getLast())            // transformation.saveAsObjectFile("under21.bin"); // actionrdd.filter(p -> p.getAge() < 21)    // transformation.map(p -> p.getLast())            // transformation.saveAsObjectFile("under21.bin"); // action

The main advantage of RDDs is that they are simple and well understood because they deal with concrete classes, providing a familiar object-oriented programming style with compile-time type-safety. For example, given an RDD containing instances of Person we can filter by age by referencing the age attribute of each Person object:

Example: Filter by attribute with RDD
Scala:

rdd.filter(_.age > 21)
rdd.filter(_.age > 21)

Java:

rdd.filter(person -> person.getAge() > 21)
rdd.filter(person -> person.getAge() > 21)

The main disadvantage to RDDs is that they don’t perform particularly well. Whenever Spark needs to distribute the data within the cluster, or write the data to disk, it does so using Java serialization by default (although it is possible to use Kryo as a faster alternative in most cases). The overhead of serializing individual Java and Scala objects is expensive and requires sending both data and structure between nodes (each serialized object contains the class structure as well as the values). There is also the overhead of garbage collection that results from creating and destroying individual objects.

DataFrame API
Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. There are also advantages when performing computations in a single process as Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, avoiding the garbage-collection costs associated with constructing individual objects for each row in the data set. Because Spark understands the schema, there is no need to use Java serialization to encode the data.
The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans, but not natural for the majority of developers. The query plan can be built from SQL expressions in strings or from a more functional approach using a fluent-style API.

Example: Filter by attribute with DataFrame
Note that these examples have the same syntax in both Java and Scala
SQL Style

df.filter("age > 21");
df.filter("age > 21");

Expression builder style:

df.filter(df.col("age").gt(21));
df.filter(df.col("age").gt(21));

Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.
Another downside with the DataFrame API is that it is very scala-centric and while it does support Java, the support is limited. For example, when creating a DataFrame from an existing RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the scala.Product interface. Scala case classes work out the box because they implement this interface.

Dataset API
The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.
When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.
Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant. In writing the examples to accompany this article, we ran into errors when trying to create a Dataset in Java from a list of Java objects that were not fully bean-compliant.

Need help with Spark APIs? Contact Us.

Example: Creating Dataset from a list of objects
Scala

val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val sampleData: Seq[ScalaPerson] = ScalaData.sampleData()
val dataset = sqlContext.createDataset(sampleData)val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val sampleData: Seq[ScalaPerson] = ScalaData.sampleData()
val dataset = sqlContext.createDataset(sampleData)

Java

JavaSparkContext sc = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(sc);
List data = JavaData.sampleData();
Dataset dataset = sqlContext.createDataset(data, Encoders.bean(JavaPerson.class));JavaSparkContext sc = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(sc);
List data = JavaData.sampleData();
Dataset dataset = sqlContext.createDataset(data, Encoders.bean(JavaPerson.class));

Transformations with the Dataset API look very much like the RDD API and deal with the Person class rather than an abstraction of a row.

Example: Filter by attribute with Dataset
Scala

dataset.filter(_.age < 21);

Java

dataset.filter(person -> person.getAge() < 21);

Despite the similarity with RDD code, this code is building a query plan, rather than dealing with individual objects, and if age is the only attribute accessed, then the rest of the the object’s data will not be read from off-heap storage.

Get started with your Big Data Strategy. Contact Us.

Conclusions
If you are developing primarily in Java then it is worth considering a move to Scala before adopting the DataFrame or Dataset APIs. Although there is an effort to support Java, Spark is written in Scala and the code often makes assumptions that make it hard (but not impossible) to deal with Java objects.
If you are developing in Scala and need your code to go into production with Spark 1.6.0 then the DataFrame API is clearly the most stable option available and currently offers the best performance.
However, the Dataset API preview looks very promising and provides a more natural way to code. Given the rapid evolution of Spark it is likely that this API will mature very quickly through 2016 and become the de-facto API for developing new applications.

Interested in learning more about AgilData products and services?
Looking for Big Data Services? Click here – we can help you.
If you want to learn more about AgilData, click here. And if you have a question for us, fill out the form, and we will get back to you!

RDD, DataFrame or Dataset相关推荐

spark sql定义RDD、DataFrame与DataSet
RDD 优点: 编译时类型安全编译时就能检查出类型错误面向对象的编程风格直接通过类名点的方式来操作数据缺点: 序列化和反序列化的性能开销无论是集群间的通信, 还是IO操作都需要对对象的结构和 ...
再谈RDD、DataFrame、DataSet关系以及相互转换（JAVA API）
Spark提供了三种主要的与数据相关的API: RDD DataFrame DataSet 三者图示下面详细介绍下各自的特点: RDD 主要描述:RDD是Spark提供的最主要的一个抽象概念(Res ...
专业工程师看过来~ | RDD、DataFrame和DataSet的细致区别
作者:苏小宝,现任华为工程师. RDD.DataFrame和DataSet是容易产生混淆的概念,必须对其相互之间对比,才可以知道其中异同. RDD和DataFrame 上图直观地体现了DataFram ...
RDD、DataFrame和DataSet
简述 RDD.DataFrame和DataSet是容易产生混淆的概念,必须对其相互之间对比,才可以知道其中异同:DataFrame多了数据的结构信息,即schema.RDD是分布式的 Java对象的集 ...
RDD和DataFrame和Dataset比较
一 SparkSQL简介 Spark SQL是一个能够利用Spark进行结构化数据的存储和操作的组件,结构化数据可以来自外部结构化数据源也可以通过RDD获取. 外部的结构化数据源包括Hive,JSON ...
spark基础之RDD和DataFrame和Dataset比较
一 SparkSQL简介 Spark SQL是一个能够利用Spark进行结构化数据的存储和操作的组件,结构化数据可以来自外部结构化数据源也可以通过RDD获取. 外部的结构化数据源包括Hive,JSON ...
RDD和DataFrame和DataSet三者间的区别
RDD:rdd是一个不可变的支持Lambda表达式的并行数据集合 Dataframe:与RDD类似,Dataframe是一个分布式的数据容器,除来数据本身,还记录着数据的结构信息,即schema,Da ...
RDD、DataFrame、DataSet原理解析
一.RDD.DataFrame.DataSet三者概念 1. RDD:全称Resilient Distributed Dataset,弹性分布式数据集,Spark中最基础的数据抽象,特点是RDD只包含 ...
【大数据开发】SparkSQL——RDD、DataFrame、DataSet相互转换、DSL常用方法、SQL风格语法、Spark读写操作、获取Column对象的方式
take,takeAsList是Action操作 limit⽅法获取指定DataFrame的前n⾏记录,得到⼀个新的DataFrame对象.和take与head不同的是,limit⽅法不是Action ...

RDD, DataFrame or Dataset

RDD, DataFrame or Dataset

RDD, DataFrame or Dataset相关推荐

最新文章

热门文章