问题描述

机房的机器发生了断电恢复。集群就呈红色

关键性描述：

nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st]

 "index" : "device_search_2020","shard" : 3,"primary" : true,"current_state" : "unassigned","unassigned_info" : {"reason" : "ALLOCATION_FAILED","at" : "2021-04-14T02:23:35.837Z","failed_allocation_attempts" : 5,"details" : """failed shard on node [LwWiAwmdQCiEibtiF7oqxQ]: failed recovery, failure RecoveryFailedException[[device_search_20201204][3]: Recovery failed on {reading_10.10.2.75_node2}{LwWiAwmdQCiEibtiF7oqxQ}{YVadGK2FSDKbR69l0Wu0xg}{10.10.2.75}{10.10.2.75:9402}{dil}{ml.machine_memory=539647844352, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: ElasticsearchException[java.io.IOException: failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st]; nested: IOException[failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st]; nested: IOException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=892219961 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=892219961 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st")))]; ""","last_allocation_status" : "no"}

从 elasticsearch head上查看集群状况：

用kibana查看集群状况

GET /_cluster/allocation/explain?pretty

运行结果

错误分析

错误产生原因：这是在机房的集群因为断电被强关了，然后产生了异常。然后集群恢复的时候报错：IOException[failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st];

这是因为断电，导致的部分文件没有被刷新。然后重新恢复的时候，去检查这些文件是否期望的，但是因为断电没有被保存，所以导致期望的版本没有被保存下来。所以集群就不承认这个分片了，所以呈现红色。

到这里都觉得文件都损坏了，还怎么恢复分片呢？

从网上找了很久，国外的网站上说这个错误，就不能恢复了，需要用快照恢复数据了。

解决方案

我们尝试出来的解决方案：通过重新路由的方式，来解决。

在kibana上执行下边命令，注意索引名，分片，数据节点这些，注意自己替换。我下边有写怎么查到这些。

POST _cluster/reroute
{
"commands": [
{
"allocate_stale_primary": {

# 这是有问题的索引
"index": "device_search_20201204",

# 这是有问题的分片
"shard": 153,

# 这是哪个数据节点
"node": "reading_10.10.2.75_node2",
"accept_data_loss": true
}
}
]
}

这是有问题的索引：

可以通过命令:GET /_cluster/allocation/explain?pretty

在kibana上执行，得到结果如下：

修复命令运行后：集群红色分片分片就变成了绿色。这个运行过程跳过了校验上图报错中说的读 retention-leases-91171.st 文件报错。IOException[failed to read /home/wsn/es/es7.5/node_2/data/nodes/0/indices/QGft9wywTOeSNjcsz_UUHA/3/_state/retention-leases-91171.st];

通过重新路由，重新生成了一份这个文件。

教训

1.如果能知道断电，则最好先把集群关停，再断电。如果不能则没有办法。

2.集群最好有一个副本，否则发生这样的情况，就只能丢数据了。

failed shard on node [XXX]， failed recovery, failure RecoveryFailedException相关推荐

ES failed shard on node[XXX]: failed recovery, failure RecoveryFailedException XXX Too many openfile
今天集群挂掉了,状态一直是red,恢复节点之后,开始恢复分片数据,一直都很慢. 直到所有分片都停止分配,但是此时集群状态还是red. 原因是:某节点上的分片尝试恢复5次没有成功,然后就丢弃不管.导致该 ...
Elasticsearch断电后启动异常(failed recovery, failure RecoveryFailedException)
断电遇到的elasticsearch6.3问题:重启es集群后索引的部分分片无法正常加载(UNASSIGNED状态). https://github.com/memoryFuhao/elasticse ...
git提交代码时出现错误：error : unpack failed : error Missing commit XXX，
Git 提交出错 git提交代码时出现错误:error : unpack failed : error Missing commit 384ccb27185a68ec9c0d0ce948e7432d6 ...
错误解决：failed calling webhook “dec-autonomy.xxx.io“: failed to call webhook：post
引言今天在删除资源时,发现删除pod命令报错(强制删除也报错),如下: 因此,网上找了一下,有一个这个答案:应该有种资源MutatingWebhookConfiguration或 Validatin ...
Pyinstaller 打包exe 报错 failed to execute script XXX的一种解决方案
最近用PyQt5写了一个界面小程序,需要打包成exe给到其他windows上使用,一开始使用python 3.7 64位,用pyinstaller打包exe,在64位机上运行正常. 但是目标电脑是32 ...
Spring boot 项目Kafka Error connecting to node xxx:xxx Kafka项目启动异常 Failed to construct kafka consumer
Spring boot 项目Kafka Error connecting to node xxx:xxx Spring boot Kafka项目启动异常新建了一个springBoot集成Kafka的 ...
redis集群中slot迁移的BUG：clusterManagerMoveSlot failed: ERR I don‘t know about node xx，解决办法如下文
这个BUG很奇怪,先放BUG出现的图片吧报错:clusterManagerMoveSlot failed: ERR I don't know about node xxx redis找不到他自己创建 ...
pyinstaller 打包 python3项目，遇到failed to execute script XXX 错误的解决方法
近日修改了之前一个使用pyinstaller可以成功打包的python3项目,在IDE中运行正常,但是打包以后运行exe却出现failed to execute script XXX 弹窗提示,无法正 ...
加载.npz文件时，出现错误：OSError: Failed to interpret file ‘xxx.npz‘ as a pickle
1..npz文件的内容是怎样的,怎么打开? 因为以npz结尾的数据集是压缩文件,里面还有其他的文件. 使用:cat_data.files 命令进行查看 import numpy as np cat_d ...
Gradle project xxx refresh failed Error:Unable to tunnel through proxy. Proxy returns HTTP/...
SVN地址更换重新导入项目到AS后,点击编译报错: Gradle project xxx refresh failed Error:Unable to tunnel through proxy. Pr ...

failed shard on node [XXX]， failed recovery, failure RecoveryFailedException