HDFS的DN退役以及如何加快DN退役速度

首先看一下选择复制源节点的源码：

/*** Parse the data-nodes the block belongs to and choose one,* which will be the replication source.** We prefer nodes that are in DECOMMISSION_INPROGRESS state to other nodes* since the former do not have write traffic and hence are less busy.* We do not use already decommissioned nodes as a source.* Otherwise we choose a random node among those that did not reach their* replication limits.  However, if the replication is of the highest priority* and all nodes have reached their replication limits, we will choose a* random node despite the replication limit.** In addition form a list of all nodes containing the block* and calculate its replication numbers.** @param block Block for which a replication source is needed* @param containingNodes List to be populated with nodes found to contain the *                        given block* @param nodesContainingLiveReplicas List to be populated with nodes found to*                                    contain live replicas of the given block* @param numReplicas NumberReplicas instance to be initialized with the *                                   counts of live, corrupt, excess, and*                                   decommissioned replicas of the given*                                   block.* @param priority integer representing replication priority of the given*                 block* @return the DatanodeDescriptor of the chosen node from which to replicate*         the given block*/@VisibleForTestingDatanodeDescriptor chooseSourceDatanode(Block block,List<DatanodeDescriptor> containingNodes,List<DatanodeStorageInfo>  nodesContainingLiveReplicas,NumberReplicas numReplicas,int priority) {containingNodes.clear();nodesContainingLiveReplicas.clear();DatanodeDescriptor srcNode = null;int live = 0;int decommissioned = 0;int corrupt = 0;int excess = 0;Collection<DatanodeDescriptor> nodesCorrupt = corruptReplicas.getNodes(block);for(DatanodeStorageInfo storage : blocksMap.getStorages(block)) {final DatanodeDescriptor node = storage.getDatanodeDescriptor();LightWeightLinkedSet<Block> excessBlocks =excessReplicateMap.get(node.getDatanodeUuid());int countableReplica = storage.getState() == State.NORMAL ? 1 : 0; if ((nodesCorrupt != null) && (nodesCorrupt.contains(node)))corrupt += countableReplica;else if (node.isDecommissionInProgress() || node.isDecommissioned())decommissioned += countableReplica;else if (excessBlocks != null && excessBlocks.contains(block)) {excess += countableReplica;} else {nodesContainingLiveReplicas.add(storage);live += countableReplica;}containingNodes.add(node);// Check if this replica is corrupt// If so, do not select the node as src nodeif ((nodesCorrupt != null) && nodesCorrupt.contains(node))continue;if(priority != UnderReplicatedBlocks.QUEUE_HIGHEST_PRIORITY&& node.getNumberOfBlocksToBeReplicated() >= maxReplicationStreams){continue; // already reached replication limit}if (node.getNumberOfBlocksToBeReplicated() >= replicationStreamsHardLimit){continue;}// the block must not be scheduled for removal on srcNodeif(excessBlocks != null && excessBlocks.contains(block))continue;// never use already decommissioned nodesif(node.isDecommissioned())continue;// we prefer nodes that are in DECOMMISSION_INPROGRESS state// 如果你是退役中的节点会被优先选择并把你赋值给srcNode,后期的判断就基本跳过DFSUtil.getRandom().nextBoolean()，因为在 if(srcNode.isDecommissionInProgress()) continue;直接continue了。if(node.isDecommissionInProgress() || srcNode == null) {srcNode = node;continue;}if(srcNode.isDecommissionInProgress())continue;// switch to a different node randomly// this to prevent from deterministically selecting the same node even// if the node failed to replicate the block on previous iterations// 这里主要是防止一直选相同节点，如果所遍历的node是退役中节点，不会走这个方法（在上层就continue了），如果某个待复制的block（正常块，不存在损坏超额等问题）所在的三个节点都是普通正常节点，那具体选择哪个node就完全随机了，虽然是个伪随机。if(DFSUtil.getRandom().nextBoolean())srcNode = node;}if(numReplicas != null)numReplicas.initialize(live, decommissioned, corrupt, excess, 0);return srcNode;}

读注释我们可以了解到在选择复制源节点的时候会遵循以下几个原则：

优先选择退役中的节点，因为其无写入请求，负载低
不选择已经退役完成的节点
如果datanode上要复制block的Queue size与target datanode没被选出之前待处理复制工作数之和未达到复制限制（<maxReplicationStreams，conf配置名为**dfs.namenode.replication.max-streams**，在nodelist中随机选择节点
如果datanode上要复制block的Queue size与target datanode没被选出之前待处理复制工作数之和达到复制限制（>=maxReplicationStreams），除非blockQueue是最高优先级的，会随机选择一台节点，否则会pass掉（代码中为continue）
如果datanode上要复制block的Queue size与target datanode没被选出之前待处理复制工作数之和达到复制硬限制（>=replicationStreamsHardLimit,conf配置为**dfs.namenode.replication.max-streams-hard-limit**），无论满足什么条件都会被pass

加快DN退役的三个参数

    <property><name>dfs.namenode.replication.max-streams</name><value>64</value></property><property><name>dfs.namenode.replication.max-streams-hard-limit</name><value>128</value></property><property><name>dfs.namenode.replication.work.multiplier.per.iteration</name><value>32</value></property>

这三个参数的默认数分别为：2，4，2

前两个参数在上面5个原则中已经提到，详细说一下第三个参数dfs.namenode.replication.work.multiplier.per.iteration，这个参数决定了可以从很多under replication blocks中选出多少个block准备进行复制。可以选出的block数与集群adminState处于live状态的datanode成正比
int blocksToProcess = numlive * this.blocksReplWorkMultiplier（就是这个参数dfs.namenode.replication.work.multiplier.per.iteration），如果线上有10个in service datanode 那默认就可以选出 10 * 2 = 20个block 准备进行replicate

这三个参数其实就是个限流参数
第三个参数相当于入口限流参数，决定了可以从under replication blocks 的集合中选出多少个blocks 加到复制队列中准备进行复制。
前两个参数相当于出口限流，他们是在遍历复制队列中的blocks时，决定是否给当前block一个srcNode（也可能返回null）。如果满足限流条件，srcNode=当前node；如果不满足就contiune,直到找到满足的node，如果所有node都不满足就返回null（如果返回null 该block是不会relicate的）；

block优先级

block所在队列的优先级有5个，源码如下

   /** The total number of queues : {@value} */
// 共五个等级static final int LEVEL = 5;/** The queue with the highest priority: {@value} */
//最高优先级queue：这些blocks会被优先复制。只有一个副本的块或者这些块有0个live副本,1个副本在正在退役的节点.如果这些blocks所在的磁盘或者服务器出现问题，这些blocks有丢失风险。static final int QUEUE_HIGHEST_PRIORITY = 0;/** The queue for blocks that are way below their expected value : {@value} */
//第二优先级: 这些blocks的副本远低于他们所期望的值。目前意味着实际的比例时少于1/3,虽然这些blocks可能并没有风险，但是他们清楚地考虑到blocks重要性。static final int QUEUE_VERY_UNDER_REPLICATED = 1;/** The queue for "normally" under-replicated blocks: {@value} */
//第三优先级：这些blocks　也低于所期望值(同样处于复制状态的块)，以及 实际：预期的比例 足够好（不少与1/3），以至于它们不需要进入队列static final int QUEUE_UNDER_REPLICATED = 2;/** The queue for blocks that have the right number of replicas,* but which the block manager felt were badly distributed: {@value}*/
//第四优先级：一个block满足要求的最小副本数，但是由于分布不充分，如果一个机架当机可能导致所有的副本丢失/下线static final int QUEUE_REPLICAS_BADLY_DISTRIBUTED = 3;/** The queue for corrupt blocks: {@value} */
//第五优先级：这适用于已损坏的块，并且当前有可用的非损坏副本（当前）。这里的策略是保持这些损坏的块被复制，但是给出不是更高优先级的块。static final int QUEUE_WITH_CORRUPT_BLOCKS = 4;