Hadoop

一、概述

1.1 大数据概念

大数据是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力来适应海量、高增长率和多样化的信息资产。——来自研究机构Gartner

1.2 大数据面临问题

存储：单机存储有限，如何解决海量数据存储？

分析：如何在合理时间范围内对数据完成成本运算？

1.3 大数据的特点

4V 特性 Volume 数量Velocity多样 Variety 时效 Value价值

1）`数据量大`

B-KB-MB-GB-TB-PB-EB-ZB…

各种云存储解决方案，百度云、腾讯微云、OneDriver、GoogleDriver等，现有的硬件资源能够支撑足够大的数据量。

(https://img-blog.csdnimg.cn/20190820180540918.jpg)

大数据产生的数据量根本还是在于人，日益进步的科技，更加美好的物质生活，更加自我的追求，催生出了互联网时代更多数据量的产生。

[外链图片转存失败(img-dnFNCpO3-1566295197550)(assets/u=961224733,2320658306&fm=179&app=42&f=JPEG.jpg)][外链图片转存失败(img-aYIbtMhy-1566295197552)(assets/u=2735437136,2812997221&fm=58&bpow=1024&bpoh=1024.jpg)][外链图片转存失败(img-yo3RY9dB-1566295197552)(assets/u=1829288647,2186263908&fm=58&bpow=860&bpoh=1023.jpg)][外链图片转存失败(img-iHGFVbwC-1566295197553)(assets/u=1066474367,396145276&fm=58&bpow=708&bpoh=708-1558191245409.jpg)][外链图片转存失败(img-YfcFHE4z-1566295197554)(assets/u=2447511464,2048184966&fm=58&bpow=674&bpoh=512.jpg)][外链图片转存失败(img-BCbXannH-1566295197555)(assets/u=2136740882,3271518133&fm=58&bpow=630&bpoh=630.jpg)]

网购平台，视频网站，健身App，支付金融App，社交App，搜索引擎等能够在人们使用的时候收集大量的数据，日活人数上亿级别的互联网公司能够在一天内轻松获取超过1PB的数据。

2）数据时效性

[外链图片转存失败(img-trBOvBEq-1566295197555)(assets/u=2166141108,24851986&fm=58&bpow=800&bpoh=528.jpg)][外链图片转存失败(img-mzbab55D-1566295197556)(assets/u=1433661816,131453648&fm=58&bpow=800&bpoh=600.jpg)]

天猫

天猫双十一总成交额在开场后第2分05秒即突破100亿元，刷新去年创下的最快破百亿记录，用时不到1小时17分超过2015年双十一全天成交额，用时不到15小时50分，成交总额超越去年全天成交额。

京东

随着618大促的结束，大家也都在关注各个电商平台的销量情况，2018京东618销售额是多少？最新数据公布，京东618累计交易额1592亿元。

大量数据在短时间内迅速产生，即要求收集者在短时间内收集存储数据，且在近期分析出有效数据。

3）数据多样性

（1）数据存储类型多样性

结构化数据：SQL，文本等

非结构化数据：视频，音频，图片

（2）数据分析类型多样性

地理位置：来自北京、上海…

设备信息：PC、手机、手表、手环

个人喜好：美女、面膜、显卡、数码、游戏

社交网络：A可能认识B，C，B可能认识C

电话号码：110,10086，10010

网络身份证：设备Mac+电话+IP

4）数据价值

警察叔叔：只关注是否哪里有违规

AI研究：只关注对AI是否有帮助（阿尔法GO)

所以在海量数据中提取有用的数据最为关键，这就是数据分析第一步要做的事情，数据降噪（数据清洗|数据预处理）

1.4 应用场景

1）个人推荐

根据用户喜好，全平台数据共享推荐

例子：小明在网页上百度了一下跑步机，打开淘宝发现个人推荐已经有跑步机的推荐了，打开京东发现搜索框出现了跑步机三个字，打开抖音刷到广告是京东的跑步机，小米商城推送一条消息，亲，米家走步机了解一下？√，×

2）风控

大数据实时流处理，根据用户行为以及行为模型的支撑，判断该操作是否正常。

支付宝：在有人盗取并修改登录和支付密码，使用陌生的设备登录后，并且进行转账操作，支付宝会禁止操作，并提示风险。

3）成本预测

通过大数据分析，得出近几年商品销售成本以及效益，商家/企业可以根据此项数据进行合理产品策略的转变，来达到企业利润的最大化。

4）气候预测

根据当代收集的数据以及往年采集的数据，预测近几年的气候变化，或者回推古代的气象异常等

5）人工智能

无人汽车：百度、Google、特斯拉

智能助手：小爱、Sire、GoogleAssisant、边小溪、小冰、小娜

1.4 工作方向

1. 业务
电商的推荐系统，智能广告系统，专家系统，智慧城市，智能交通，金融大脑，智慧医疗，灾害预警....
2. 工作方向
大数据运维工程师，大数据开发工程师（实时计算，数据仓库，ETL,基本挖掘），数据分析师（算法）

数据量大/数据时效性|数据处理速度快/数据多样性（维度）/数据有价值-降噪

1.5 分布式

为了解决现实问题，存储和分析，大数据的存储和分析都必须在集群，一是为了拥有足够的容量进行存储，二是能够高效率分分析数据。

通常将跨机器/跨进程/跨虚拟机架构成为分布式架构，因为硬件垂直成本较高且不可控，相比垂直垂直提升成本较高且不可控，相比较垂直提升水平扩展成本较低，能够是的投入和产出趋近于线性。

硬件资源有了？软件实现怎么搞？

二、Hadoop

https://blog.csdn.net/lfq1532632051/article/details/53219558

Hadoop是在2006年雅虎从Nuthc（给予Java爬虫框架）工程中剥离一套分布式的解决方案。该方案参考了Google的GFS和MapReduce论文，当时发布的版本称为Hadoop-1.x,并且在2010年雅虎对Hadoop又做了一次升级，该次升级的目的是优化了Hadoop的MapReduce框架，是的Hadoop更加易用，用户只需要少许的配置，就可以使用Hadoop实现海量数据存储和大规模数据集的分析。一个由Apache基金会所开发的分布式系统基本框架。

HDFS：Hadoop Distribute File System

Map Reduce：Hadoop中的分布式计算框架，实现对海量数据并行分析和计算

2.1 Hadoop 生态系统

HDFS：分布式存储系统

MapReduce：并行计算框架

HBase：基于HDFS之上一款NoSQL数据库（名副其实海量数据存储解决方案）

Hive：是一款SQL的解析引擎，可以将SQL翻译成MapReduce任务，将任务提交给MapReduce框架

Flume：分布式日志收集系统，用于收集海量数据，并且存储到HDFS/HBase

Kafka：分布式消息系统，实现分布式系统间解耦和海量数据的缓冲

ZooKeeper：分布式协调服务，用于服务注册中心/配置中心/集群选举/状态检测/分布式锁

2.2 大数据分析方案

Map Reduce：代表基于磁盘离线大数据静态批处理框架-延迟较高30分钟+

Saprk：代表基于内存实时（离线）大数据静态批处理框架-几乎是MapReduce的10-100倍速度

Storm/ Saprk Streaming|Flink|Kafka Stream：实时的流处理框架，达到对记录级别的数据显示毫秒级处理

三、HDFS

3.1 安装（伪集群）

目前使用伪分布式单机进行测试|学习，生产集群之后会讲

1）安装CentOS 64Bit（需要开启Intel 虚化技术）

2）安装JDK 8

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

下载Java8 Linux 64位
并解压至指定目录

3）配置环境变量

（1）配置用户变量

vi  /root/.bashrc
--------------------------
export JAVA_HOME=/home/java/jdk1.8.0_181   #   Java在Linux下的安装路径
export CLASSPATH=.
export PATH=$PATH:$JAVA_HOME/bin       #   Path记得加上JAVA_HOME

（2）配置主机名和IP映射关系 `/etc/hosts`

vi /etc/sysconfig/networkNETWORKING=yes
HOSTNAME=CentOS  #主机名|域名

vi /etc/hosts127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.169.139 CentOS

（3）关闭防火墙

service iptables stop    #关闭服务
chkconfig iptables off  #关闭开机自起

因为搭建分布式服务之间可能产生相互的调度，为了保证正常的通信，一般需要关闭防火墙

（4）配置主机SSH免密登录认证

SSH是Secure Shell 的缩写，SSH为建立在应用层基础上的安全协议，专为远程登录好会话和其他网络服务提供安全性的协议

基于口令的安全验证：基于口令用户名/密码

基于密钥的安全验证：需要依靠密钥，也就是你必须为你自己创建一对密钥，并把公用密钥放在需要访问的服务器上。如果你要连接到SSH服务器上，客户端软件就会向服务器发出请求，请求用你的密钥进行安全验证。服务器收到请求之后，先在该服务器上你的主目录下寻找你的公用密钥，然后把它和你发过来的公用密钥进行比较。如果两个密钥一致，服务器就用公用密钥加密“质询”（challenge）并把它发送给客户端软件。客户端软件收到“质询”之后就可以用你的私人密钥解密再把它发送给服务器。

[外链图片转存失败(img-ojuomCkd-1566295197557)(assets/ssh免密登录.png)]

ssh-keygen -t rsa                #生成
ssh-copy-id HadoopNode00        #复制

4）Hadoop HDFS的安装与配置

参考:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

（1）解压至自定义目录

（2）配置Hadoop的环境变量

vi  /root/.bashrc
--------------------------
export JAVA_HOME=/home/java/jdk1.8.0_181/  #   Java在Linux下的安装路径
export CLASSPATH=.
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0/
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

HADOOP_HOME环境变量被第三方产品所依赖例如：hbase/Hive/Flume/Spark在集成Hadoop的时候，是通过读取HADOOP_HOME环境确定Hadoop位置

（3）配置ect/hadoop/core-site.xml

<property><name>fs.defaultFS</name><value>hdfs://CentOS:9000</value>
</property><property><name>hadoop.tmp.dir</name><value>/usr/hadoop-2.6.0/hadoop-${user.name}</value>
</property>

（4）配置etc/hadoop/hdfs-site.xml

<property><name>dfs.replication</name><value>1</value>
</property>

5）启动HDFS

（1）格式化namenode

如果是第一次启动HDFS，需要格式化namenode

hdfs namenode -format    #配置环境变量后，任何位置都可找到hdfs命令

格式化成功后，用户可以看到以下目录结构

[root@CentOS ~]# tree /usr/hadoop-2.6.0/hadoop-root/
/usr/hadoop-2.6.0/hadoop-root/
└── dfs└── name└── current├── fsimage_0000000000000000000├── fsimage_0000000000000000000.md5├── seen_txid└── VERSION

（2）启动HDFS服务

 start-dfs.sh    #直接复制进命令行，回车即可jps            #可以通过jps命令查看系统中Java进程 一般有DataNode NameNode SecondaryNameNodestop-dfs.sh     #停止HDFS的进程

配置WIN下的hosts文件 C:\Windows\System32\drivers\etc

用户可以通过访问浏览器http://虚拟IP或者域名:50070

3.2 HDFS相关的操作

1）HDFS Shell

Usage: hadoop fs [generic options][-appendToFile <localsrc> ... <dst>][-cat [-ignoreCrc] <src> ...][-checksum <src> ...][-chgrp [-R] GROUP PATH...][-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...][-chown [-R] [OWNER][:[GROUP]] PATH...][-copyFromLocal [-f] [-p] [-l] [-d] <localsrc> ... <dst>][-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>][-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] <path> ...][-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>][-createSnapshot <snapshotDir> [<snapshotName>]][-deleteSnapshot <snapshotDir> <snapshotName>][-df [-h] [<path> ...]][-du [-s] [-h] [-x] <path> ...][-expunge][-find <path> ... <expression> ...][-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>][-getfacl [-R] <path>][-getfattr [-R] {-n name | -d} [-e en] <path>][-getmerge [-nl] [-skip-empty-file] <src> <localdst>][-help [cmd ...]][-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]][-mkdir [-p] <path> ...][-moveFromLocal <localsrc> ... <dst>][-moveToLocal <src> <localdst>][-mv <src> ... <dst>][-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>][-renameSnapshot <snapshotDir> <oldName> <newName>][-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...][-rmdir [--ignore-fail-on-non-empty] <dir> ...][-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]][-setfattr {-n name [-v value] | -x name} <path>][-setrep [-R] [-w] <rep> <path> ...][-stat [format] <path> ...][-tail [-f] <file>][-test -[defsz] <path>][-text [-ignoreCrc] <src> ...][-touchz <path> ...][-truncate [-w] <length> <path> ...][-usage [cmd ...]]Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.The general command line syntax is
command [genericOptions] [commandOptions]

[root@CentOS ~]# hdfs dfs -help
Usage: hadoop fs [generic options][-appendToFile <localsrc> ... <dst>][-cat [-ignoreCrc] <src> ...][-checksum <src> ...][-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...][-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>][-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>][-cp [-f] [-p | -p[topax]] <src> ... <dst>][-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>][-help [cmd ...]][-ls [-d] [-h] [-R] [<path> ...]][-mkdir [-p] <path> ...][-moveFromLocal <localsrc> ... <dst>][-moveToLocal <src> <localdst>][-mv <src> ... <dst>][-put [-f] [-p] [-l] <localsrc> ... <dst>][-rm [-f] [-r|-R] [-skipTrash] <src> ...][-rmdir [--ignore-fail-on-non-empty] <dir> ...][-tail [-f] <file>][-text [-ignoreCrc] <src> ...][-touchz <path> ...][-usage [cmd ...]]

（1）上传文件

[root@HadoopNode00 hadoop-2.6.0]# hadoop fs -put /root/install.log /

[外链图片转存失败(img-cIjlmEm2-1566295197557)(assets/1558233393643.png)]

（2）下载文件

[root@HadoopNode00 hadoop-2.6.0]# hadoop fs -get /install.log  /root/1.log[root@HadoopNode00 hadoop-2.6.0]# ls  /root
1.log  anaconda-ks.cfg  install.log  install.log.syslog

（3）删除文件

[root@HadoopNode00 hadoop-2.6.0]# hadoop fs -rm /install.log
19/05/19 07:59:17 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /install.log

[外链图片转存失败(img-yVuWLIhU-1566295197558)(assets/1558233651723.png)]

（5）开启HDFS回收站

etc/hadoop/core-site/xml

<property><name>fs.trash.interval</name><value>1</value>
</property>

设置一分钟后延迟，1分钟后被删除文件会系统彻底删除，防止用户误操作

[root@CentOS ~]# hdfs dfs -rm -r -f /bb.log
19/01/03 12:27:08 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://CentOS:9000/bb.log' to trash at: hdfs://CentOS:9000/user/root/.Trash/Current
#没有添加任何参数直接删除会发现系统给出提示
[root@CentOS ~]# hdfs dfs -rm -r -f -skipTrash /aa.log
Deleted /aa.log
#在给出-skipTrash参数时给出的已经删除了

2）Java API操纵HDFS

（1）Maven依赖

<dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId><version>2.6.0</version>
</dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.6.0</version>
</dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.12</version>
</dependency>

注意：此依赖请严格按照当前环境中Hadoop版本书写

（2）Windows 下Hadoop环境配置

解压hadoop安装包到C:/ (这里可以放在其他目录下)
将winutils.exe和hadoop.dll拷贝到hadoop的bin目录下
在windows配置HADOOP_HOME环境变量
重启开发工具idea,否则开发工具无法识别HADOOP_HOME
在Windows主机配置CentOS的主机名和IP的映射关系

C:\Windows\System32\drivers\etc\hosts

192.168.134.208 CentOS

（3）HDFS权限不足导致写失败？

org.apache.hadoop.security.AccessControlException: Permission denied: user=HIAPAD, access=WRITE, inode="/":root:supergroup:drwxr-xr-xat org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
...

解决方案

（1）方案1

etc/hadoop/hdfs-site.xml

<property><name>dfs.permissions.enabled</name><value>false</value>
</property>

关闭HDFS文件权限检查，修改完成后，重启HDFS服务

（2）方案2

-DHADOOP_USER_NAME=root

设置Java虚拟机参数Java XXX -Dxx =XXX

（3）方案3

  System.setProperty("HADOOP_USER_NAME", "root");

（4）相关操作

import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Progressable;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import static org.junit.Assert.*;
import java.io.*;public class TestHDFSDemo {private FileSystem fileSystem;private Configuration conf;@Beforepublic void before() throws IOException {conf=new Configuration();/** 第一种方式：直接使用代码设置相关参数* */// configuration.set("fs.defaultFS", "hdfs://Hadoop01:8020");/** 第二种方式 使用配置文件的方式** *///configuration.addResource("core-site.xml");//configuration.addResource("hdfs-site.xml");conf.addResource("core-site.xml");conf.addResource("hdfs-site.xml");fileSystem=FileSystem.newInstance(conf);}@Testpublic void testConfig(){String value = conf.get("dfs.replication");System.out.println(value);}@Testpublic void testUpload01() throws IOException {String file="C:\\Users\\HIAPAD\\Desktop\\SpringBoot启动原理.pdf";Path dst=new Path("/demo/access/springBoot.pdf");InputStream is  = new FileInputStream(file);OutputStream os = fileSystem.create(dst, new Progressable() {public void progress() {System.out.print(".");}});IOUtils.copyBytes(is,os,1024,true);}@Testpublic void testUpload02() throws IOException {Path src=new Path("C:\\Users\\HIAPAD\\Desktop\\SpringBoot启动原理.pdf");Path dst=new Path("/springBoot1.pdf");fileSystem.copyFromLocalFile(src,dst);}@Testpublic void testDownload01() throws IOException {String file="C:\\Users\\HIAPAD\\Desktop\\SpringBoot启动原理1.pdf";Path dst=new Path("/springBoot.pdf");OutputStream os  = new FileOutputStream(file);InputStream is = fileSystem.open(dst);IOUtils.copyBytes(is,os,1024,true);}@Testpublic void testDownload02() throws IOException {Path dst=new Path("C:\\Users\\HIAPAD\\Desktop\\SpringBoot启动原理3.pdf");Path src=new Path("/springBoot1.pdf");//fileSystem.copyToLocalFile(src,dst);fileSystem.copyToLocalFile(false,src,dst,true);}@Testpublic void testDelete() throws IOException {Path src=new Path("/user");fileSystem.delete(src,true);//true 表示递归删除子文件夹}@Testpublic void testExists() throws IOException {Path src=new Path("/springBoot1.pdf");boolean exists = fileSystem.exists(src);assertTrue(exists);}@Testpublic void testMkdir() throws IOException {Path src=new Path("/demo/access");boolean exists = fileSystem.exists(src);if(!exists){fileSystem.mkdirs(src);}}@Testpublic void testListFiles() throws IOException {Path src=new Path("/");RemoteIterator<LocatedFileStatus> files = fileSystem.listFiles(src, true);while (files.hasNext()){LocatedFileStatus file = files.next();System.out.println(file.getPath()+" "+file.isFile()+" "+file.getLen());BlockLocation[] locations = file.getBlockLocations();for (BlockLocation location : locations) {System.out.printl n("offset:"+location.getOffset()+",length:"+location.getLength());}}}@Testpublic void testDeleteWithTrash() throws IOException {Trash trash=new Trash(fileSystem,conf);Path dst=new Path("/springBoot1.pdf");trash.moveToTrash(dst);}@Afterpublic void after() throws IOException {fileSystem.close();}}

3）相关应用

简单日志采集定时上传

package hdfs;import io.gjf.runtime.QuartzManager;
import org.quartz.Scheduler;
import org.quartz.SchedulerException;
import org.quartz.impl.StdSchedulerFactory;public class GetLog {public static void main(String[] args) throws SchedulerException {Scheduler scheduler = StdSchedulerFactory.getDefaultScheduler();QuartzManager quartzManager = new QuartzManager();quartzManager.setScheduler(scheduler);/*Quartz是一个定时任务框架，此处定义的是会定时的执行LogJob 里面的功能在下面代码中最后一个参数是设置他如何进行周期性的循环执行，也就是规定了LogJob中代码在何时执行，这种设置的参数叫做cron表达式30 */1 * * * ?     这个cron表达式的意思就表示每一分钟的第30秒的时候执行LogJob中的代码*/quartzManager.addJob("testJob01", "Group01", "Group01", "Group01", LogJob.class, "30 */1 * * * ?");}}

package hdfs;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.quartz.Job;
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;import java.io.File;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;public class LogJob implements Job {private static FileSystem fileSystem;private static Configuration configuration;@Overridepublic void execute(JobExecutionContext jobExecutionContext) {System.setProperty("HADOOP_USER_NAME", "root");configuration = new Configuration();try {fileSystem = FileSystem.newInstance(configuration);} catch (IOException e) {e.printStackTrace();}configuration.addResource("core-site.xml");configuration.addResource("hdfs-site.xml");SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd-hh-mm");String format = simpleDateFormat.format(new Date(new Date().getTime() - 120000));String filename = "access.tmp" + format + ".log";System.out.println(filename);File file = new File("E:\\home\\logs\\"+filename);if (file.exists()) {System.out.println("开始上传");try {fileSystem.copyFromLocalFile(new Path(file.getAbsolutePath()), new Path("/log/" + filename));boolean exists = fileSystem.exists(new Path("/log/" + filename));if (exists) {System.out.println("上传成功");} else {System.out.println("上传失败");}} catch (IOException e) {e.printStackTrace();}}else {System.out.println("该文件不存在，无需上传");}try {fileSystem.close();} catch (IOException e) {e.printStackTrace();}// System.out.println("1212312312");}
}

3.3 HDFS Architecture

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

[外链图片转存失败(img-mCW45NSe-1566295197559)(assets/hdfsarchitecture.png)]

namenode：存储系统的元数据（用于描述数据的数据），例如文件命名空间/block到DataNode的映射，负责管理DataNode

datanode：用于存储数据块的节点，负责响应客户端对块的读写请求，向NameNode汇报自己块信息

block:数据块，是对文件拆分的最小单元，表示一个切分尺度默认值128MB，每个数据块的默认副本因子是3，通过dfs.replication进行配置，用户可以通过dfs.blocksize设置块大小

rack：机架，使用机架对存储节点做物理编排，用于优化存储和计算，查看机架

[root@CentOS ~]# hdfs dfsadmin -printTopology
Rack: /default-rack192.168.169.139:50010 (CentOS)

1）什么是block块？

<property><name>dfs.blocksize</name><value>134217728</value><description>The default block size for new files, in bytes.You can use the following suffix (case insensitive):k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),Or provide complete size in bytes (such as 134217728 for 128 MB).</description>
</property>

默认配置文件中给出的默认参数为128M，数据块，是对文件拆分的最小单元。

[外链图片转存失败(img-hKEDbQCk-1566295197560)(assets/1558246509238.png)]

（1）为什么块的大小默认是128M?

在2.X版本默认为128M，1.x版本默认为64M。

工业限制，一般来说廉价PC机械硬盘满速为100M/S左右

软件优化，通常认为最佳状态为寻址时间为传输时间的100分之一

（2）Block块大小能否随意设置？

答案当然是不能，如果Block块设置的过小，集群中几百万个Block块会增加寻址时间，效率低下。

但是如果太大，存取数据的时间将会变长，效率还是低。

所以说Block快大小取决于机器的配置

2）什么是机架

https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one node in the local rack, another on a different node in the local rack, and the last on a different node in a different rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is far less than that of node failure; this policy does not impact data reliability and availability guarantees. However, it does reduce the aggregate network bandwidth used when reading data since a block is placed in only two unique racks rather than three. With this policy, the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks. This policy improves write performance without compromising data reliability or read performance.The current, default replica placement policy described here is a work in progress.

[外链图片转存失败(img-JBWLxC1x-1566295197560)(assets/1558249650241.png)]

3）HDFS写数据流程

[外链图片转存失败(img-xFvSV3PL-1566295197561)(assets/1558251177709.png)]

4）HDFS读数据流程

[外链图片转存失败(img-fI0q1TqW-1566295197562)(assets/1558251894947.png)]

5） NameNode和Secondary 的关系（重点）

Fsimage：元数据信息的备份（持久化），会加载在内存中

https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html

edits：Edits文件帮助记录文件增加和更新操作，提高效率

https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html

NameNode在启动的时候需要加载edits(日志文件)和fsimage(文件)

所以在第一启动namenode时候需要格式化namenode

当用户上传文件或者进行其他文件的操作的时候，会将数据写入至edits文件中，这样edits和fsimage加起来的数据永远是最新的。

如果此时用户一直进行操作，会导致edits文件过于庞大，这就导致了在下次启动的时候(因为启动时需要加载两个文件)，时间会相当的长。

为了解决这个问题，出现了SecondaryNodenode，将当前NameNode的edits和fsimage文件加载过来，将文件持久化到fsimage之后，将新的fsimage上传至NameNode。

但是这个时候还会出现另外一个问题，当SecondaryNamenode进行文件持久化的时候，用户可能在这个期间需要进行操作，直接将数据写入edits日志文件的话会导致数据的紊乱，所以解决方案是将数据写入另外一个叫做edits_inprogress文件当中

值得注意的是：SecondaryNamenode是对Namenode的优化方案

[外链图片转存失败(img-I9eNUVjY-1566295197563)(assets/secondary.png)]

[外链图片转存失败(img-7cgw4hUG-1566295197564)(assets/1558253545628.png)]

6）NameNode Checkpoint Node（检查节点）

https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Checkpoint_Node

NameNode persists its namespace using two files: fsimage, which is the latest checkpoint of the namespace and edits, a journal (log) of changes to the namespace since the checkpoint. When a NameNode starts up, it merges the fsimage and edits journal to provide an up-to-date view of the file system metadata. The NameNode then overwrites fsimage with the new HDFS state and begins a new edits journal.The Checkpoint node periodically creates checkpoints of the namespace. It downloads fsimage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode. The Checkpoint node usually runs on a different machine than the NameNode since its memory requirements are on the same order as the NameNode. The Checkpoint node is started by bin/hdfs namenode -checkpoint on the node specified in the configuration file.The location of the Checkpoint (or Backup) node and its accompanying web interface are configured via the dfs.namenode.backup.address and dfs.namenode.backup.http-address configuration variables.The start of the checkpoint process on the Checkpoint node is controlled by two configuration parameters.dfs.namenode.checkpoint.period, set to 1 hour by default, specifies the maximum delay between two consecutive checkpointsdfs.namenode.checkpoint.txns, set to 1 million by default, defines the number of uncheckpointed transactions on the NameNode which will force an urgent checkpoint, even if the checkpoint period has not been reached.The Checkpoint node stores the latest checkpoint in a directory that is structured the same as the NameNode’s directory. This allows the checkpointed image to be always available for reading by the NameNode if necessary. See Import checkpoint.Multiple checkpoint nodes may be specified in the cluster configuration file.For command usage, see namenode.

6）为什么说HDFS 不擅长存储小文件？

文件	namenode占用（内存）	datanode占用磁盘
128M单个文件	1一个block元数据信息	128M * 副本因子
128M 10000个文件	10000个block元数据信息	128M * 副本因子

因为Namenode是使用单机的内存存储元数据，因此导致namenode内存紧张

怎么解决小文件存储问题？

7）Safemode

https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Safemode

During start up the NameNode loads the file system state from the fsimage and the edits log file. It then waits for DataNodes to report their blocks so that it does not prematurely start replicating the blocks though enough replicas already exist in the cluster. During this time NameNode stays in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes have reported that most file system blocks are available. If required, HDFS could be placed in Safemode explicitly using bin/hdfs dfsadmin -safemode command. NameNode front page shows whether Safemode is on or off. A more detailed description and configuration is maintained as JavaDoc for setSafeMode().

8）DataNode工作机制

[外链图片转存失败(img-jSB48U8p-1566295197564)(assets/1558262416674.png)]

9）CheckSum 校验原理（了解）

检测数据是否损坏的常见措施是，在数据第一次引入系统时计算校验和(checksum)并存储，在数据进行传输后再次计算校验和进行对比，如果计算所得的新校验和和原来的校验和不匹配，就认为数据已损坏。但该技术并不能修复数据——它只能检测出数据错误。（这正是不使用低端硬件的原因。具体说来，一定要使用ECC内存。）注意，校验和也是可能损坏的，不只是数据，但由于校验和比数据小得多，所以损坏的可能性非常小。

crc

具体做法是：每当hadoop创建文件a时，hadoop就会同时在同一个文件夹下创建隐藏文件.a.crc，这个文件记录了文件的校验和。针对数据文件的大小，每512个字节会生成一个32位的校验和（4字节），可以在src/core/core-default.xml中通过修改io.bytes.per.checksum的大小来修改每个校验和所针对的文件的大小。

[外链图片转存失败(img-XfnURjwf-1566295197565)(assets/1558263997525.png)]

ys in Safemode. Safemode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves Safemode automatically after the DataNodes have reported that most file system blocks are available. If required, HDFS could be placed in Safemode explicitly using bin/hdfs dfsadmin -safemode command. NameNode front page shows whether Safemode is on or off. A more detailed description and configuration is maintained as JavaDoc for setSafeMode().

### 8）DataNode工作机制[外链图片转存中...(img-jSB48U8p-1566295197564)]### 9）CheckSum 校验原理（了解）检测数据是否损坏的常见措施是，在数据第一次引入系统时计算校验和(checksum)并存储，在数据进行传输后再次计算校验和进行对比，如果计算所得的新校验和和原来的校验和不匹配，就认为数据已损坏。但该技术并不能修复数据——它只能检测出数据错误。（这正是不使用低端硬件的原因。具体说来，一定要使用ECC内存。）注意，校验和也是可能损坏的，不只是数据，但由于校验和比数据小得多，所以损坏的可能性非常小。**crc**具体做法是：每当hadoop创建文件a时，hadoop就会同时在同一个文件夹下创建隐藏文件.a.crc，这个文件记录了 文件的校验和。针对数据文件的大小，每512个字节会生成一个32位的校验和（4字节），可以在src/core/core-default.xml中通过修改io.bytes.per.checksum的大小来修改每个校验和所针对的文件的大小。[外链图片转存中...(img-XfnURjwf-1566295197565)]

Hadoop-HDFS 简介以及配置和JAVA API操作相关推荐

Hadoop读书笔记（三）Java API操作HDFS
Hadoop读书笔记(一)Hadoop介绍:http://blog.csdn.net/caicongyang/article/details/39898629 Hadoop读书笔记(二)HDFS的sh ...
Windows下配置Hadoop的Java开发环境以及用Java API操作HDFS
场景 HDFS的访问方式之HDFS shell的常用命令: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/details/119351218 在上 ...
Hadoop详解（四）：HDFS shell操作和Java API操作
1. HDFS环境准备 1.1 HDFS的格式化与启动 HDFS配置完之后就可以对其进行格式化操作.在NameNode所在机器上执行如下命令进行HDFS的格式化操作: hadoop namenode ...
Windows下使用Java API操作HDFS的常用方法
场景 Windows下配置Hadoop的Java开发环境以及用Java API操作HDFS: https://blog.csdn.net/BADAO_LIUMANG_QIZHI/article/det ...
HDFS Java API 操作
文章目录 HDFS Java API操作零.启动hadoop 一.HDFS常见类接口与方法 1.hdfs 常见类与接口 2.FileSystem 的常用方法二.Java 创建Hadoop项目 1. ...
Hbase 完全分布式模式的搭建、命令行操作、Java API操作
追风赶月莫停留,平芜尽处是春山. 文章目录追风赶月莫停留,平芜尽处是春山. 环境 Hbase 完全分布式模式的搭建一.下载安装包,解压到合适位置: 二.配置相关的文件: 三.将Hbase复制到其他 ...
使用 Java API 操作 HBase
使用 Java API 操作 HBase 数据库,就类似HBase Shell,本质上一个是Java 代码,一个是Shell 命令.(hadoop 的文件系统莫不如此,可用Java API 的方式操作 ...
Hbase java API操作(模板代码)
Hbase java API操作 1 创建maven工程导入jar包 <repositories><repository><id>cloudera</id& ...
大数据技术之_20_Elasticsearch学习_01_概述 + 快速入门 + Java API 操作 + 创建、删除索引 + 新建、搜索、更新删除文档 + 条件查询 + 映射操作
大数据技术之_20_Elasticsearch学习_01 一概述 1.1 什么是搜索? 1.2 如果用数据库做搜索会怎么样? 1.3 什么是全文检索和 Lucene? 1.4 什么是 Elastic ...

Hadoop-HDFS 简介以及配置和JAVA API操作