Spark - SizeEstimator.estimate 字节估算之时间都去哪了

一.引言

org.apache.spark.util.SizeEstimator 类提供了 estimate 方法，该方法估计给定对象在JVM堆上占用的字节数。估计包括给定对象引用的对象占用的空间、它们的引用等。使用场景主要用于 spark 计算 broadCast 的内存容量，因为是 estimate ，所以对于指定对象在 JVM 上占用的字节数只是估算，而非实际。

二.使用

    val yourClass = Class.forName("className")val size = SizeEstimator.estimate(yourClass)println(s"预估大小: ${size}")

yourClass 为要预估的类，可以是 Scala / Java 原生的数据类，也可以是自己定义的任意 Class。

esitimte 会对 yourClass 的 ClassInfo 进行大小估算，ClassInfo 包含 shellSize 和 pointerFilelds 两个属性代表该类的缓存大小：

  private class ClassInfo(val shellSize: Long,val pointerFields: List[Field]) {}

A.shellSize：

所有非静态字段的大小加上 java.lang.Object大小，可以理解为 class 内自己定义的变量

B.pointerFileds：

指向对象的任何字段，例如 class 内的 HashMap 存在 key 指向对应的 value

3.注意事项

1.字节估算

SizeEstimator.estimate 主要依赖 estimate 方法和 visitSingleObject 方法估算一个 class 的大小，其内部采用队列实现，通过 enqueue 将对象放入队列，随后 while 循环 dequeue 累加大小，直到队列为空。

A.主方法 estimate

用于初始化记录状态的 SearchState，随后调用 visitSingleObject 判断每个对象的大小

  private def estimate(obj: AnyRef, visited: IdentityHashMap[AnyRef, AnyRef]): Long = {val state = new SearchState(visited)state.enqueue(obj)while (!state.isFinished) {visitSingleObject(state.dequeue(), state)}state.size}

B.辅助方法 visitSingleObject

该方法会对队列中的 Object 进行估算，同时忽略 scala.reflect 和 classLoader 相关的对象，因为前者引用全局反射对象，而全局反射对象又引用许多其他大型全局对象，后者引用了整个REPL，这些引用都会混淆估算器评估对应类的大小。

case _ 可以看到具体的累加过程，首先添加类的全部 shellSize，即内部变量大小，随后对于所有带有引用的对象，也会压入队列进行递归的计算，直到对列清空。

  private def visitSingleObject(obj: AnyRef, state: SearchState) {val cls = obj.getClassif (cls.isArray) {visitArray(obj, cls, state)} else if (cls.getName.startsWith("scala.reflect")) {// Many objects in the scala.reflect package reference global reflection objects which, in// turn, reference many other large global objects. Do nothing in this case.} else if (obj.isInstanceOf[ClassLoader] || obj.isInstanceOf[Class[_]]) {// Hadoop JobConfs created in the interpreter have a ClassLoader, which greatly confuses// the size estimator since it references the whole REPL. Do nothing in this case. In// general all ClassLoaders and Classes will be shared between objects anyway.} else {obj match {case s: KnownSizeEstimation =>state.size += s.estimatedSizecase _ =>val classInfo = getClassInfo(cls)state.size += alignSize(classInfo.shellSize)for (field <- classInfo.pointerFields) {state.enqueue(field.get(obj))}}}}

Tips：

由于指向性对象会递归压入队列，当 class 内有同一类的多个指向索引时，该 class 的估算值会增大，因为多个索引指向同一对象，其实际占用字节在内存中只有一份，而递归累加会累加多次。

    class testEstimate {val mapA = new mutable.HashMap[String, Object]()val mapB = new mutable.HashMap[Int, Object]()val mapC = new mutable.HashMap[Double, Object]()val testObject = new Object()mapA("string") = testObjectmapB(1) = testObjectmapC(1D) = testObject}

例如预估 testEstimate 类的大小时，mapA，mapB，mapC 均指向了同一对象 testObject，实际占用大小为 Size(testObject) x1，但 visitSingleObject 方法会针遍历 class 的 classInfo.pointerFields 从而导致估算的值中累加了 Size(testObject) x 3 的数量，所以当类内指向性结构很多时，估算的准确性会下降。

2.效率问题

esitmate 内部采用队列 + 递归的形式遍历 class 内全部变量与全部引用对象及其变量和引用，所以当类特别复杂时，estimate 会比较耗时，即该方法存在性能问题，有性能需求的小伙伴要注意该方法在代码中的使用。

A.估算一个数组

  def estimator(k: Int): Unit = {val testArray = (0 to k).toArrayval st = System.currentTimeMillis()val size = SizeEstimator.estimate(testArray)println(s"预估大小: ${size} 耗时: ${System.currentTimeMillis() - st}")estimator(10000)

预估大小: 40024 耗时: 324

B.估算一个DataBase

    val start = System.currentTimeMillis()val dataBaseSize = SizeEstimator.estimate(dataBase)val end = System.currentTimeMillis()println(s"DB size: $dataBaseSize 耗时: ${end - st}")

DB size: 2605368184 耗时: 44033

可以看到估算一个较大的对象时，该方法的耗时还是比较可观的，所以代码中要慎用，因为耗时问题查了好久最后定位是这个方法 o(╥﹏╥)o