java spark persist,hadoop – 我的sparkDF.persist(DISK

对于简短的回答,我们可以看一下关于spark.local.dir的

the documentation：

Directory to use for “scratch” space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks. NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

为了更深入地理解我们可以查看代码：DataFrame(它只是一个数据集[Row])基于RDDs,它利用相同的持久性机制. RDD将此委托给SparkContext,这标志着它是持久性的.然后,该任务由org.apache.spark.storage程序包中的几个类实际处理：首先,BlockManager仅管理要保留的数据块以及如何执行该策略,将实际持久性委派给DiskStore(当在磁盘上写入时,当然)代表一个高级别的写入界面,而后者又有一个DiskBlockManager用于更低级别的操作.

希望您了解现在的位置,以便我们继续前进并了解数据实际存在的位置以及我们如何配置它：DiskBlockManager调用帮助程序Utils.getConfiguredLocalDirs,为了实用,我将在此处复制(取自链接的2.2.1版本,撰写本文时的最新版本)：

def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {

val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)

if (isRunningInYarnContainer(conf)) {

// If we are in yarn mode, systems can have different disk layouts so we must set it

// to what Yarn on this system said was available. Note this assumes that Yarn has

// created the directories already, and that they are secured so that only the

// user has access to them.

getYarnLocalDirs(conf).split(",")

} else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {

conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)

} else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {

conf.getenv("SPARK_LOCAL_DIRS").split(",")

} else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {

// Mesos already creates a directory per Mesos task. Spark should use that directory

// instead so all temporary files are automatically cleaned up when the Mesos task ends.

// Note that we don't want this if the shuffle service is enabled because we want to

// continue to serve shuffle files after the executors that wrote them have already exited.

Array(conf.getenv("MESOS_DIRECTORY"))

} else {

if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {

logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +

"spark.shuffle.service.enabled is enabled.")

}

// In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user

// configuration to point to a secure directory. So create a subdirectory with restricted

// permissions under each listed directory.

conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")

}

我相信代码是非常不言自明的,并且评论很好(并且完全匹配文档的内容)：在Yarn上运行时,有一个特定的策略依赖于Yarn容器的存储,在Mesos中它使用Mesos sandbox(除非启用了shuffle服务),在所有其他情况下,它将转到spark.local.dir或java.io.tmpdir(可能是/ tmp /)下设置的位置.

所以,如果你只是在玩数据很可能存储在/ tmp /下,否则它很大程度上取决于你的环境和配置.

java spark persist,hadoop – 我的sparkDF.persist(DISK_ONLY)数据存储在哪里？相关推荐

Spark初识-Spark与Hadoop的比较
Spark,是分布式计算平台,是一个用scala语言编写的计算框架,基于内存的快速.通用.可扩展的大数据分析引擎 Hadoop,是分布式管理.存储.计算的生态系统:包括HDFS(存储).MapRedu ...
Spark中CheckPoint、Cache、Persist的用法、区别
Spark中CheckPoint.Cache.Persist 大家好,我是一拳就能打爆A柱的猛男这几天看到一套视频<尚硅谷2021迎新版大数据Spark从入门到精通>,其中有关于检查点( ...
Spark面试中的cache和persist
一:相同 cache和persist都是同于讲一个RDD进行缓存,这样在之后的使用的时候,不用重头计算加载数据,可以大大节省程序运行时间, 二:区别 cache和persist的区别了:cache只有 ...
storm是java还是python_Storm与Spark、Hadoop相比是否有优势
本帖最后由 oracle_cj 于 2014-8-13 20:56 编辑 1. Storm是什么,怎么做,如何做的更好? 分布式实时计算系统.按照storm作者的说法,storm对于实时计算的意义类似 ...
spark VS Hadoop 两大大数据分析系统深度解读
大数据,无论是从产业上,还是从技术上来看,都是目前的发展热点.在中国,政府控制着80%的数据,剩下的多由"BAT"这样的大公司拥有,中小企业如何构建自己的大数据系统?其他企业如何建 ...
BigData：大数据开发的简介、核心知识(linux基础+Java/Python编程语言+Hadoop{HDFS、HBase、Hive}+Docker)、经典场景应用之详细攻略
BigData:大数据开发的简介.核心知识(linux基础+Java/Python编程语言+Hadoop{HDFS.HBase.Hive}+Docker).经典场景应用之详细攻略 BigData:大数 ...
spark和hadoop的区别
直接比较Hadoop和Spark有难度,因为它们处理的许多任务都一样,但是在一些方面又并不相互重叠. 比如说,Spark没有文件管理功能,因而必须依赖Hadoop分布式文件系统(HDFS)或另外某种解 ...
spark、hadoop、storm、solr、es在车辆分析上的分析与比较
自2012年以来,公安部交通管理局在全国范围内推广了机动车缉查布控系统(简称卡口系统),通过整合共享各地车辆智能监测记录等信息资源,建立了横向联网.纵向贯通的全国机动车缉查布控系统,实现了大范围车辆缉 ...
java spark yarn_Spark on yarn
概述 spark on yarn是spark集群模式之一,通过resourcemanager进行调度,较之standalone模式,不需要单独启动spark服务. 关于spark 的三种模式,上一篇文 ...

java spark persist,hadoop – 我的sparkDF.persist(DISK_ONLY)数据存储在哪里？

java spark persist,hadoop – 我的sparkDF.persist(DISK_ONLY)数据存储在哪里？相关推荐

最新文章

热门文章