1、前言

HDF文件是遥感应用中一种常见的数据格式,因为其高度结构化的特点,笔者曾被怎样使用Hadoop处理HDF文件这个问题困扰过相当长的一段时间。于是Google各种解决方式,但都没有找到一种理想的处理办法。也曾參考过HDFGroup官方发的一篇帖子(网址在这里),里面提供了使用Hadoop针对大、中、小HDF文件的处理思路。尽管依据他提供的解决的方法,按图索骥,肯定能解决怎样使用Hadoop处理HDF文件这个问题,但个人感觉方法偏复杂且须要对HDF的数据格式有较深的理解,实现起来不太easy。于是乎,笔者又继续寻找解决方式,最终发现了一种办法,以下将对该方法进行详细说明。

2、MapReduce主程序

这里主要使用到了netcdf的库进行hdf数据流的反序列化工作(netcdf库的下载地址)。与HDF官方提供的Java库不同,netcdf仅利用Java进行HDF文件的读写操作,且这个库支持多种科学数据,包含HDF4、HDF5等多种格式。而HDF的官方Java库中,底层实际仍是用C进行HDF文件的操作。以下贴出MapReduce的Mapper函数代码:

package example;import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.net.URI;
import java.util.List;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;import ucar.ma2.ArrayShort;
import ucar.nc2.Dimension;
import ucar.nc2.Group;
import ucar.nc2.NetcdfFile;
import ucar.nc2.Variable;public class ReadMapper extendsMapper<Text, BytesWritable, Text, BytesWritable> {public void map(Text key, BytesWritable value, Context context)throws IOException, InterruptedException {       String fileName = key.toString();NetcdfFile file = NetcdfFile.openInMemory("hdf4", value.get());Group dataGroup = (file.findGroup("MOD_Grid_monthly_1km_VI")).findGroup("Data_Fields");//读取到1_km_monthly_red_reflectance的变量Variable redVar = dataGroup.findVariable("1_km_monthly_red_reflectance");short[][] data = new short[1200][1200];if(dataGroup != null){         ArrayShort.D2 dataArray;//读取redVar中的影像数据dataArray = (ArrayShort.D2) redVar.read();List<Dimension> dimList = file.getDimensions();//获取影像的y方向像元个数Dimension ydim = dimList.get(0);//获取影像的x方向像元个数Dimension xdim = dimList.get(1);//遍历整个影像,读取出像元的值for(int i=0;i<xdim.getLength();i++){for(int j=0;j<ydim.getLength();j++){data[i][j] = dataArray.get(i, j);                    }               }                                   }       System.out.print(file.getDetailInfo());}
}

注意程序中的NetcdfFile.openInMemory方法,该静态方法支持从byte[]中构造HDF文件,从而实现了HDF文件的反序列化操作。以下贴出主程序的演示样例代码:

package example;import java.io.IOException;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;import example.WholeFileInputFormat;public class ReadMain {public boolean runJob(String[] args) throws IOException,ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();// conf.set("mapred.job.tracker", Utils.JOBTRACKER);String rootPath= "/opt/hadoop-2.3.0/etc/hadoop";//String rootPath="/opt/hadoop-2.3.0/etc/hadoop/";conf.addResource(new Path(rootPath+"yarn-site.xml"));conf.addResource(new Path(rootPath+"core-site.xml"));conf.addResource(new Path(rootPath+"hdfs-site.xml"));conf.addResource(new Path(rootPath+"mapred-site.xml"));Job job = new Job(conf);job.setJobName("Job name:" + args[0]);job.setJarByClass(ReadMain.class);job.setMapperClass(ReadMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(BytesWritable.class);job.setInputFormatClass(WholeFileInputFormat.class);job.setOutputFormatClass(NullOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[1]));FileOutputFormat.setOutputPath(job, new Path(args[2]));boolean flag = job.waitForCompletion(true);return flag;}public static void main(String[] args) throws ClassNotFoundException,IOException, InterruptedException {String[] inputPaths = new String[] { "normalizeJob","hdfs://192.168.168.101:9000/user/hduser/hdf/MOD13A3.A2005274.h00v10.005.2008079143041.hdf","hdfs://192.168.168.101:9000/user/hduser/test/" };ReadMain test = new ReadMain();test.runJob(inputPaths);}}

关于MapReduce主程序有几点值得说明一下:

1、MapReduce数据的输入格式为WholeFileInputFormat.class,即不正确数据进行切分。关于该格式,能够參考另外一篇博客:怎样通过Java程序提交Yarn的计算任务,这里不再赘述。

2、本人用的是Yarn2.3.0来运行计算任务,假设用老版本号的hadoop,如1.2.0,则把以上主程序中的conf.addResource部分的代码删掉就可以。

3、以上MapReduce程序中,仅仅用到了Map函数,未设置Reduce函数。

4、以上程序用到的为HDF4格式的数据,按理说,HDF5格式的数据应该也是支持的。

3、HDF数据的格式

因为HDF数据高度结构化,因此在netcdf库的使用中,须要使用类似于"标签"的方式来訪问HDF中的详细数据。以下贴出netcdf中读出来的HDF数据的详细格式信息(即使用file.getDetailInfo()函数,打印出来的信息):

注意,ReadMapper函数中出现的类似于“MOD_Grid_monthly_1km_VI”、"Data_Fields"等信息,即依据下面HDF数据的格式信息得到的。

netcdf D:/2005-274/MOD13A3.A2005274.h00v08.005.2008079142757.hdf {variables:char StructMetadata.0(32000);char CoreMetadata.0(40874);char ArchiveMetadata.0(6530);group: MOD_Grid_monthly_1km_VI {variables:short _HDFEOS_CRS;:Projection = "GCTP_SNSOID";:UpperLeftPointMtrs = -2.0015109354E7, 1111950.519667; // double:LowerRightMtrs = -1.8903158834333E7, -0.0; // double:ProjParams = 6371007.181, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0; // double:SphereCode = "-1";group: Data_Fields {dimensions:YDim = 1200;XDim = 1200;variables:short 1_km_monthly_NDVI(YDim=1200, XDim=1200);:long_name = "1 km monthly NDVI";:units = "NDVI";:valid_range = -2000S, 10000S; // short:_FillValue = -3000S; // short:scale_factor = 10000.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_EVI(YDim=1200, XDim=1200);:long_name = "1 km monthly EVI";:units = "EVI";:valid_range = -2000S, 10000S; // short:_FillValue = -3000S; // short:scale_factor = 10000.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_VI_Quality(YDim=1200, XDim=1200);:_Unsigned = "true";:long_name = "1 km monthly VI Quality";:units = "bit field";:valid_range = 0S, -2S; // short:_FillValue = -1S; // short:Legend = "\n\t Bit Fields Description (Right to Left): \n\t[0-1] : MODLAND_QA [2 bit range]\n\t\t 00: VI produced, good quality \n\t\t 01: VI produced, but check other QA \n\t\t 10: Pixel produced, but most probably cloudy \n\t\t 11: Pixel not produced due to other reasons than clouds \n\t[2-5] : VI usefulness [4 bit range]  \n\t\t 0000: Highest quality  \n\t\t 0001: Lower quality  \n\t\t 0010..1010: Decreasing quality  \n\t\t 1100: Lowest quality  \n\t\t 1101: Quality so low that it is not useful \n\t\t 1110: L1B data faulty \n\t\t 1111: Not useful for any other reason/not processed \n\t[6-7] : Aerosol quantity [2 bit range] \n\t\t 00: Climatology \n\t\t 01: Low \n\t\t 10: Average \n\t\t 11: High (11) \n\t[8] : Adjacent cloud detected; [1 bit range] \n\t\t 1: Yes \n\t\t 0: No \n\t[9] : Atmosphere BRDF correction performed [1 bit range] \n\t\t 1: Yes \n\t\t 0: No \n\t[10] : Mixed clouds  [1 bit range] \n\t\t 1: Yes \n\t\t 0: No \n\t[11-13] : Land/Water Flag [3 bit range]   \n\t\t 000: Shallow ocean \n\t\t 001: Land (Nothing else but land) \n\t\t 010: Ocean coastlines and lake shorelines \n\t\t 011: Shallow inland water \n\t\t 100: Ephemeral water \n\t\t 101: Deep inland water \n\t\t 110: Moderate or continental ocean \n\t\t 111: Deep ocean \n\t[14] : Possible snow/ice [1 bit range] \n\t\t 1: Yes \n\t\t 0: No \n\t[15] : Possible shadow [1 bit range] \n\t\t 1: Yes \n\t\t 0: No \n";short 1_km_monthly_red_reflectance(YDim=1200, XDim=1200);:long_name = "1 km monthly red reflectance";:units = "reflectance";:valid_range = 0S, 10000S; // short:_FillValue = -1000S; // short:scale_factor = 10000.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_NIR_reflectance(YDim=1200, XDim=1200);:long_name = "1 km monthly NIR reflectance";:units = "reflectance";:valid_range = 0S, 10000S; // short:_FillValue = -1000S; // short:scale_factor = 10000.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_blue_reflectance(YDim=1200, XDim=1200);:long_name = "1 km monthly blue reflectance";:units = "reflectance";:valid_range = 0S, 10000S; // short:_FillValue = -1000S; // short:scale_factor = 10000.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_MIR_reflectance(YDim=1200, XDim=1200);:long_name = "1 km monthly MIR reflectance";:units = "reflectance";:valid_range = 0S, 10000S; // short:_FillValue = -1000S; // short:Legend = "\n\t The MIR band saved in the VI product is MODIS band 7 \n\t\t Bandwidth : 2105-2155 nm \n\t\t Band center: 2130 nm \n";:scale_factor = 10000.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_view_zenith_angle(YDim=1200, XDim=1200);:long_name = "1 km monthly view zenith angle";:units = "degrees";:valid_range = -9000S, 9000S; // short:_FillValue = -10000S; // short:scale_factor = 100.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_sun_zenith_angle(YDim=1200, XDim=1200);:long_name = "1 km monthly sun zenith angle";:units = "degrees";:valid_range = -9000S, 9000S; // short:_FillValue = -10000S; // short:scale_factor = 100.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intshort 1_km_monthly_relative_azimuth_angle(YDim=1200, XDim=1200);:long_name = "1 km monthly relative azimuth angle";:units = "degrees";:valid_range = -3600S, 3600S; // short:_FillValue = -4000S; // short:scale_factor = 10.0; // double:scale_factor_err = 0.0; // double:add_offset = 0.0; // double:add_offset_err = 0.0; // double:calibrated_nt = 5; // intbyte 1_km_monthly_pixel_raliability(YDim=1200, XDim=1200);:long_name = "1 km monthly pixel raliability";:units = "rank";:valid_range = 0B, 3B; // byte:_FillValue = -1B; // byte:Legend = "\n\t Rank Keys: \n\t\t[-1]:  Fill/No Data-Not Processed. \n\t\t [0]:  Good data     - Use with confidence \n\t\t [1]:  Marginal data - Useful, but look at other QA information \n\t\t [2]:  Snow/Ice      - Target covered with snow/ice\n\t\t [3]:  Cloudy        - Target not visible, covered with cloud \n";}}// global attributes::HDFEOSVersion = "HDFEOS_V2.9";:_History = "Direct read of HDF4 file through CDM library; HDF-EOS StructMetadata information was read";:HDF4_Version = "4.2.1 (NCSA HDF Version 4.2 Release 1-post3, January 27, 2006)";:featureType = "GRID";
}

转载于:https://www.cnblogs.com/hrhguanli/p/4485773.html

Hadoop处理HDF文件相关推荐

  1. hadoop HDFS常用文件操作命令

    命令基本格式: hadoop fs -cmd < args > 1.ls hadoop fs -ls / 列出hdfs文件系统根目录下的目录和文件 hadoop fs -ls -R / 列 ...

  2. Hadoop控制输出文件命名

    参考:http://blog.csdn.net/zuochanxiaoheshang/article/details/8769198 Hadoop 控制输出文件命名 在一般情况下,Hadoop 每一个 ...

  3. Hadoop之HDFS文件操作

    摘要:Hadoop之HDFS文件操作常有两种方式,命令行方式和JavaAPI方式.本文介绍如何利用这两种方式对HDFS文件进行操作. 关键词:HDFS文件    命令行     Java API HD ...

  4. linux 查看hdfs文件,Hadoop之HDFS文件操作

    摘要:Hadoop之HDFS文件操作常有两种方式,命令行方式和JavaAPI方式.本文介绍如何利用这两种方式对HDFS文件进行操作. 关键词:HDFS文件    命令行    Java API HDF ...

  5. HDF文件(.h5)的写入与读取

    1. 使用pd.DataFrame将数据写进数据框 df = pd.DataFrame([[1, 1.0, 'a']], columns=['x', 'y', 'z']) 2. 把DataFrame写 ...

  6. hadoop之slaves文件详细分析

    hadoop之saves文件详细分析(一) 注:所有操作基于hadopp-2.7.5,本篇文章主要涉及一些对于slaves文件之于hadoop平台的思考 首先大家都知道,要想配置一个完全分布式平台,首 ...

  7. HEG安装及hdf文件转tif文件批处理

    HEG(HDF-EOS to GeoTIFF (HEG) Converter)安装及hdf文件转tif文件批处理 HEG下载地址: https://ladsweb.modaps.eosdis.nasa ...

  8. hadoop重命名文件_hadoop HDFS常用文件操作命令

    命令基本格式: hadoop fs -cmd < args > 1.ls hadoop fs -ls / 列出hdfs文件系统根目录下的目录和文件 hadoop fs -ls -R / 列 ...

  9. python读取hdf文件 高效_Python解析HDF文件

    前段时间因为一个业务的需求需要解析一个HDF格式的文件.在这之前也不知道到底什么是HDF文件.百度百科的解释如下: HDF是用于存储和分发科学数据的一种自我描述.多对象文件格式.HDF是由美国国家超级 ...

最新文章

  1. HDU 1848 Fibonacci again and again
  2. es6 函数解构的用途
  3. java 对外提供接口_Java服务器对外提供接口以及Android端向服务器请求数据
  4. 使用webpack搭建个性化项目
  5. vim 正则非贪婪模式
  6. Apple Watch问与答
  7. Python高手之路【十二】面向对象设计模式
  8. Hive里的分区、分桶、视图和索引再谈
  9. [GUET-CTF2019]re-[SUCTF2019]SignIn-相册-[ACTF新生赛2020]usualCrypt
  10. 宏观经济的基本指标及其衡量
  11. 高级系统架构师培训笔记
  12. 小数,分数,百分数及倍数的怎么表达?怎么读? kira86 于2010-07-07发布 l 已有1958人浏览增大字体 减小字体 常态文玩 数百名外教任意选,每天陪你练口语 一个积分学英语,您的账户
  13. 极速office(Word)怎么删除页眉
  14. java魔箭天使apk_java魔箭天使apk下载|java游戏魔箭天使安装包下载v1.0....
  15. 计算机应用基础2020年最新档案,计算机应用基础 高职计算机大类专业 刁爱军项目三 人事档案管理.pptx...
  16. C++顺序栈的实现(进栈,出栈,判断栈空,打印输出,获取栈顶元素)
  17. 读书百客:《题竹林寺》简析
  18. 作为PM,要知道的四类产品文档
  19. abaqus python_ABAQUS中的python语言入门
  20. 复式记账法-银行业务

热门文章

  1. 今年因为疫情很多信用卡逾期,结果会怎么样?
  2. 视频号直播带货成交的三大关键
  3. 专业思维和战略思维的区别在哪
  4. 老板要先想明白三件事
  5. 为什‮多很了学么‬营销‮识知‬依然赚不到钱?
  6. webservice接口和restful接口哪个更好?
  7. 电脑cpu和手机cpu的差距有多大?
  8. Merkle Patricia Tree 详解
  9. python入门——P37类和对象:面向对象编程
  10. 错误:cc1: error: unrecognized command line option “-m32”