Contents
  1. 1. 背景
  2. 2. 分析
  3. 3. 趣点

背景

集群HDFS机器硬盘故障,运维请求更换RAID卡,更换后合作方称无法启动Namenode,报错类似如下:

Safe mode is ON. The reported blocks 3 needs additional 2 blocks to reach the threshold 0.9990 of total blocks 5. The number of live datanodes 2 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

分析

表现:文件损坏/文件副本丢失

解决:检查确定损坏的文件,如无关重要,直接删除,否则需要重新生成数据

趣点

损坏的文件中有一堆如下的文件:

1
2
3
4
5
6
7
8
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1663581298239_2800/job.split:  Under replicated BP-670265145-10.172.19.97-1636095455105:blk_1084485129_10762725. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1663581298239_2801/job.jar: Under replicated BP-670265145-10.172.19.97-1636095455105:blk_1084488349_10765945. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1663581298239_2801/job.split: Under replicated BP-670265145-10.172.19.97-1636095455105:blk_1084488350_10765946. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.....
/tmp/hadoop-yarn/staging/hadoop/.staging/job_1663581298239_2810/job.jar: Under replicated BP-670265145-10.172.19.97-1636095455105:blk_1084504517_10782115. Target Replicas is 10 but found 5 live replica(s), 0 decommissioned replica(s) and 0 decommissioning replica(s).
.

通过关键词“job.jar job.split 副本”可以获得以下两个网页,并据此判断为MR任务相关文件,且是此处定义的10个副本。
https://blog.csdn.net/hfcloudeagle/article/details/84513388
https://www.cnblogs.com/sidmeng/archive/2012/04/11/2442367.html

前往hadoop官网查看配置文档,搜索replication即可获知配置项mapreduce.client.submit.file.replication默认为10。
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

重点来了,该参数的描述为:

The replication level for submitted job files. This should be around the square root of the number of nodes.

也就是说,默认参数下对应的集群节点规模为100!

当下许多地方拿着几台机器就在大喊大数据大数据,连MR都对它呵呵了😄