1、问题描述,备库故障RECOVERING

运营同事说查询mongodb备库数据,没有最新的记录,估计是复制延时了,或者是故障了,赶紧上去查看状态rs.status(),看到备库处于RECOVERING状态

shard1:RECOVERING> rs.status();

{

"set" : "shard1",

"date" : ISODate("2017-03-03T03:08:50.882Z"),

"myState" : 3,

"members" : [

{

"_id" : 0,

"name" : "192.168.3.11:27017",

"health" : 1,

"state" : 1,

"stateStr" : "PRIMARY",

"uptime" : 69310,

"optime" : Timestamp(1488510526, 3),

"optimeDate" : ISODate("2017-03-03T03:08:46Z"),

"lastHeartbeat" : ISODate("2017-03-03T03:08:50.416Z"),

"lastHeartbeatRecv" : ISODate("2017-03-03T03:08:49.706Z"),

"pingMs" : 0,

"electionTime" : Timestamp(1479454146, 1),

"electionDate" : ISODate("2016-11-18T07:29:06Z"),

"configVersion" : 1

},

{

"_id" : 1,

"name" : "192.168.3.12:27017",

"health" : 1,

"state" : 3,

"stateStr" : "RECOVERING",

"uptime" : 69311,

"optime" : Timestamp(1471072341, 1),

"optimeDate" : ISODate("2016-08-13T07:12:21Z"),

"configVersion" : 1,

"self" : true

},

{

"_id" : 2,

"name" : "192.168.3.11:27037",

"health" : 1,

"state" : 7,

"stateStr" : "ARBITER",

"uptime" : 69310,

"lastHeartbeat" : ISODate("2017-03-03T03:08:50.412Z"),

"lastHeartbeatRecv" : ISODate("2017-03-03T03:08:50.322Z"),

"pingMs" : 0,

"configVersion" : 1

}

],

"ok" : 1

}

shard1:RECOVERING>

2、从后台error日志分析replSet errorRS102

查看下后台日志路径:

[mongodb@mongodb_m2 ~]$ ps -eaf|grep 27017

mongodb  24630     1  0 Mar02 ?        00:03:41 /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongod --shardsvr --replSet shard1 --port 27017 --dbpath /data/mongodb/shard27017 --oplogSize 2048 --logpath /data/mongodb/logs/shard_m1s1_27017.log --logappend --fork

mongodb  39309 30937  0 10:35 pts/0    00:00:00 grep 27017

[mongodb@mongodb_m2 ~]$

查看后台error日志显示没,找到错误信息

more /data/mongodb/logs/shard_m1s1_27017.log

2017-03-03T09:44:59.070+0800 I REPL     [ReplicationExecutor] syncing from: 192.168.3.11:27017

2017-03-03T09:44:59.071+0800 W REPL     [rsBackgroundSync] we are too stale to use 192.168.3.11:27017 as a sync source

2017-03-03T09:44:59.071+0800 I REPL     [ReplicationExecutor] could not find member to sync from

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet error RS102 too stale to catch up

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet our last optime : Aug 13 15:12:21 57aec855:1

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet oldest available is Feb  7 14:13:10 58996576:1

2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

2017-03-03T09:45:18.914+0800 I NETWORK  [conn6420] end connection 192.168.3.11:5804 (3 connections now open)

2017-03-03T09:45:18.915+0800 I NETWORK  [initandlisten] connection accepted from 192.168.3.11:5824 #6423 (4 connections now open)

2017-03-03T09:45:20.195+0800 I NETWORK  [conn6421] end connection 192.168.3.11:5806 (3 connections now open)

2017-03-03T09:45:20.196+0800 I NETWORK  [initandlisten] connection accepted from 192.168.3.11:5829 #6424 (4 connections now open)

看记录“replSet oldest available isFeb  7 14:13:10 58996576:1”得知这个副本集合里面最新的记录是2月7日同步过来,从那之后,sync就停止了,所以我们需要再次人工手动进行同步sync复制,表面现象是这样的,具体详细的复制信息,我们还要再去命令窗口查看。

3、主库备库查看复制集信息

去备库secondary查看复制集信息

shard1:RECOVERING>  db.printReplicationInfo();

configured oplog size:   2048.003890991211MB

log length start to end: 11028041secs (3063.34hrs)

oplog first event time:  Thu Apr 07 2016 23:51:40 GMT+0800 (CST)

oplog last event time:   Sat Aug 13 2016 15:12:21 GMT+0800 (CST)

now:                     Fri Mar 03 2017 10:37:25 GMT+0800 (CST)

shard1:RECOVERING>

可以看到维护窗口为3063.34小时,oplog日志大小为2g,oplog开始时间2016年4月7日,openlog结束日期为2016年8月13日。表示这台备库已经断档很久很久了。

再看primary主库的复制信息:

shard1:PRIMARY>  db.printReplicationInfo();

configured oplog size:   2048.003890991211MB

log length start to end: 2059878secs (572.19hrs)

oplog first event time:  Tue Feb 07 2017 14:31:13 GMT+0800 (CST)

oplog last event time:   Fri Mar 03 2017 10:42:31 GMT+0800 (CST)

now:                     Fri Mar 03 2017 10:42:32 GMT+0800 (CST)

shard1:PRIMARY>

可以看出,主库的服务起始时间oplog记录是在2017年2月7日,最后是在2017年3月3日。而看上面记录备库sencondary的最后openlog记录也是在2017年2月7日,这个时间比较吻合,也就是主库服务重启后,备库接收到了sync复制信息,但是因为断档时间是2016年8月13日这个时间太久了,导致sync失败。所以我们需要再次人工同步。

4、人工同步secondary备库

看error日志里面提供的sync的资料 2017-03-03T09:44:59.071+0800 I REPL     [rsBackgroundSync] replSet Seehttp://dochub.mongodb.org/core/resyncingaverystalereplicasetmember,发现有如下几种办法同步

(1)Automatically Sync a Member 自动同步

WARNING

Duringinitial sync, mongod will remove the content of the dbPath.

步骤

You can also force a mongod that is alreadya member of the set to perform an initial sync by restarting the instancewithout the content of the dbPath as follows:

Stopthe member’s mongod instance. To ensure a clean shutdown, use thedb.shutdownServer() method from the mongo shell or on Linux systems, the mongod--shutdown option.

Deleteall data and sub-directories from the member’s data directory. By removing thedata dbPath, MongoDB will perform a complete resync. Consider making a backupfirst.

(2)Sync by Copying Data Files from Another Member,从另外一个成员拷贝数据文件

This approach “seeds” a new or stale memberusing the data files from an existing member of the replica set. The data filesmust be sufficiently recent to allow the new member to catch up with the oplog.Otherwise the member would need to perform an initial sync.

(2.1)Copy the Data Files,         停止备库,然后从seed服务器(也就是primary库)copy数据文件,在copy的时候,注意要把local库也复制过来,复制不能采用mongodump,仅仅只允许使用快照备份数据文件( only a snapshot backup),

(2.2)Sync the Member,启动mongodb实例服务,然后开始应用oplog日志

5、开始恢复secondary备库

分析了上面的2种方式,第一种方式,清空数据目录重启mongodb实例让mongodb初始化同步数据,操作简单,但是恢复时间比较长,需要花费更多时间替换数据,第二种方式从副本集合的另外一个成员拷贝数据目录后重启mongodb实例,这个恢复过程速度快但是需要比较多的手工操作步骤。

这里综合考虑,简单方便,所以采用第一种方案恢复.

(1)先关闭mongodb server

shard1:RECOVERING> db.shutdownServer();

2017-03-03T11:10:34.536+0800 I NETWORK  DBClientCursor::init call() failed

server should be down...

2017-03-03T11:10:34.539+0800 I NETWORK  trying reconnect to localhost:27017 (127.0.0.1) failed

2017-03-03T11:10:34.539+0800 W NETWORK  Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused

2017-03-03T11:10:34.539+0800 I NETWORK  reconnect localhost:27017 (127.0.0.1) failed failed couldn't connect to server localhost:27017 (127.0.0.1), connection attempt failed

2017-03-03T11:10:34.543+0800 I NETWORK  trying reconnect to localhost:27017 (127.0.0.1) failed

2017-03-03T11:10:34.543+0800 W NETWORK  Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused

2017-03-03T11:10:34.543+0800 I NETWORK  reconnect localhost:27017 (127.0.0.1) failed failed couldn't connect to server localhost:27017 (127.0.0.1), connection attempt failed

>

(2)然后移除旧目录,再启动mongodb实例

[mongodb@mongodb_m2 shard27017]$ mv /data/mongodb/shard27017 /data/mongodb/shard27017_bak

[mongodb@mongodb_m2 shard27017]$ mkdir /data/mongodb/shard27017

[mongodb@mongodb_m2 shard27017]$ /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongod --shardsvr --replSet shard1 --port 27017 --dbpath /data/mongodb/shard27017 --oplogSize 2048 --logpath /data/mongodb/logs/shard_m1s1_27017.log --logappend --fork

about to fork child process, waiting until server is ready for connections.

forked process: 44687

child process started successfully, parent exiting

[mongodb@mongodb_m2 shard27017]$

(3)查看恢复状态,为STARTUP2,会看到数据目录文件在不停的复制中

shard1:STARTUP2> rs.status();

{

"set" : "shard1",

"date" : ISODate("2017-03-03T03:19:43.367Z"),

"myState" : 5,

"syncingTo" : "192.168.3.11:27017",

"members" : [

{

"_id" : 0,

"name" : "192.168.3.11:27017",

"health" : 1,

"state" : 1,

"stateStr" : "PRIMARY",

"uptime" : 85,

"optime" : Timestamp(1488511178, 8),

"optimeDate" : ISODate("2017-03-03T03:19:38Z"),

"lastHeartbeat" : ISODate("2017-03-03T03:19:41.796Z"),

"lastHeartbeatRecv" : ISODate("2017-03-03T03:19:41.796Z"),

"pingMs" : 0,

"electionTime" : Timestamp(1479454146, 1),

"electionDate" : ISODate("2016-11-18T07:29:06Z"),

"configVersion" : 1

},

{

"_id" : 1,

"name" : "192.168.3.12:27017",

"health" : 1,

"state" : 5,

"stateStr" : "STARTUP2",

"uptime" : 141,

"optime" : Timestamp(0, 0),

"optimeDate" : ISODate("1970-01-01T00:00:00Z"),

"syncingTo" : "192.168.3.11:27017",

"configVersion" : 1,

"self" : true

},

{

"_id" : 2,

"name" : "192.168.3.11:27037",

"health" : 1,

"state" : 7,

"stateStr" : "ARBITER",

"uptime" : 85,

"lastHeartbeat" : ISODate("2017-03-03T03:19:41.796Z"),

"lastHeartbeatRecv" : ISODate("2017-03-03T03:19:41.796Z"),

"pingMs" : 0,

"configVersion" : 1

}

],

"ok" : 1

}

shard1:STARTUP2>

6、查看恢复结果

[mongodb@mongodb_m2 mongodb]$  /usr/local/mongodb-linux-x86_64-3.0.3/bin/mongo localhost:27017/admin

MongoDB shell version: 3.0.3

connecting to: localhost:27017/admin

Server has startup warnings:

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten]

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.

2017-03-03T11:18:16.884+0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten]

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'

2017-03-03T11:18:16.885+0800 I CONTROL  [initandlisten]

shard1:PRIMARY> rs.status();

{

"set" : "shard1",

"date" : ISODate("2017-03-03T03:31:34.528Z"),

"myState" : 1,

"members" : [

{

"_id" : 0,

"name" : "192.168.3.11:27017",

"health" : 1,

"state" : 2,

"stateStr" : "SECONDARY",

"uptime" : 797,

"optime" : Timestamp(1488511889, 2),

"optimeDate" : ISODate("2017-03-03T03:31:29Z"),

"lastHeartbeat" : ISODate("2017-03-03T03:31:32.612Z"),

"lastHeartbeatRecv" : ISODate("2017-03-03T03:31:33.347Z"),

"pingMs" : 0,

"syncingTo" : "192.168.3.12:27017",

"configVersion" : 1

},

{

"_id" : 1,

"name" : "192.168.3.12:27017",

"health" : 1,

"state" : 1,

"stateStr" : "PRIMARY",

"uptime" : 852,

"optime" : Timestamp(1488511889, 2),

"optimeDate" : ISODate("2017-03-03T03:31:29Z"),

"electionTime" : Timestamp(1488511825, 1),

"electionDate" : ISODate("2017-03-03T03:30:25Z"),

"configVersion" : 1,

"self" : true

},

{

"_id" : 2,

"name" : "192.168.3.11:27037",

"health" : 1,

"state" : 7,

"stateStr" : "ARBITER",

"uptime" : 797,

"lastHeartbeat" : ISODate("2017-03-03T03:31:32.612Z"),

"lastHeartbeatRecv" : ISODate("2017-03-03T03:31:33.347Z"),

"pingMs" : 0,

"configVersion" : 1

}

],

"ok" : 1

}

shard1:PRIMARY>

7、重建oplog方式

> use local

> db.oplog.rs.drop()

>db.createCollection("oplog.rs", {"capped" : true,"size" : 23 * 1024 * 1024 * 1024})

> db.runCommand( { create:"oplog.rs", capped: true, size: (23 * 1024 * 1024 * 1024) } )

MongoDB 分片集群故障RECOVERING 处理纪实相关推荐

  1. TiDB和MongoDB分片集群架构比较

    此文已由作者温正湖授权网易云社区发布. 欢迎访问网易云社区,了解更多网易技术产品运营经验. 最近阅读了TiDB源码的说明文档,跟MongoDB的分片集群做了下简单对比. 首先展示TiDB的整体架构 M ...

  2. 搭建MongoDB分片集群

    在部门服务器搭建MongoDB分片集群,记录整个操作过程,朋友们也可以参考. 计划如下: 用5台机器搭建,IP分别为:192.168.58.5.192.168.58.6.192.168.58.8.19 ...

  3. mongodb 启动_精心总结--mongodb分片集群启动与关闭

    概述 网上教程有点坑啊,很多bug,今天刚好涉及到所以抽空整理了下mongodb分片集群启动与关闭方面的教程.希望对大家有点帮助. 基础环境配置 192.168.240.145 测试Nosql服务器1 ...

  4. MongoDb分片集群认证

    本文主要基于已经搭建好的未认证集群,结合上篇Mongodb副本集分片集群模式环境部署(https://www.cnblogs.com/woxingwoxue/p/9875878.html), Mong ...

  5. 实战:MongoDB 分片集群Shard Cluster 搭建(1台路由节点,3台配置节点,9台分片节点)

    MongoDB 分片集群搭建(分片集群安全认证+SpringBoot访问) 一.搭建要求 一台路由节点 IP地址:192.168.80.110 端口:11111 三套复制集(每个分片节点中的复制集 都 ...

  6. DOCKER-COMPOSE搭建MONGODB分片集群(单机版)

    docker-compose搭建mongodb分片集群(单机版) - 丰哥坑到 - 博客园

  7. MongoDB——MongoDB分片集群(Sharded Cluster)两种搭建方式

    MongoDB分片集群(Sharded Cluster)两种搭建方式 MongoDB分片的概念 分片集群包含的组件 分片集群架构目标 MongoDB分片集群搭建 第一套副本集 第二套副本集 配置节点副 ...

  8. 2021-05-12 MongoDB面试题 什么是MongoDB分片集群

    什么是MongoDB分片集群? Sharding cluster是一种可以水平扩展的模式,在数据量很大时特给力,实际大规模应用一般会采用这种架构去构建.sharding分片很好的解决了单台服务器磁盘空 ...

  9. mongodb分片集群突然停电造成一些错误,分片无法启动

    今天突然停电使mongodb分片集群造成这种错误,暂时不知道怎么解决,如果 有人知道请回复我 ,现在把记录下来,等后期处理. Fri Aug  8 10:49:52.165 [initandliste ...

  10. mongodb 分片集群安装 -- 二进制文件安装

    一.安装前准备 机器分配: 三台机器 A :192.168.19.101 B: 192.168.19.102 C: 192.168.19.103 A机器:mongos .config server.s ...

最新文章

  1. Python 还能实现图片去雾?FFA 去雾算法、暗通道去雾算法用起来!(附代码)...
  2. 负载均衡(Load Balancing)学习笔记(二)
  3. TikTok跨境出海:Tiktok怎么月入几十W?
  4. 1.14 字符串查找(3种方法)indexOf(), lastlndexOf(), charAt()
  5. 多路测量实时同步工作原理_MCC 134测量热电偶的工作原理
  6. 不要笑!写 | 还是 || ,还真是一个问题
  7. step与matlab的opc,wincc与matlab通过OPC通讯
  8. 文本导入ORACLE快速,Oracle批量导入文本文件快速的方法(sqlldr实现)
  9. 纯CSS3实现动态火车行驶特效
  10. Win7中如何删除访问共享时所保存的用户名和密码
  11. 计算机科学类单独分区,2020年中科院JCR分区升级版和基础版区别大解析
  12. 二叉树的前中后序遍历(非递归实现)
  13. 论文篇-----高速公路交通流数据质量控制及评价方法
  14. 全新版大学英语综合教程第一册学习笔记(原文及全文翻译)——5 - A Valentine Story(爱情故事)
  15. android studio中添加EditText错误的解决方法
  16. 单相全桥逆变原理及仿真实验
  17. 深度学习笔记之一—— 生翻 deeplearning(2015,Yann LeCun, Yoshua Bengio Geoffrey Hinton)
  18. TypeError: classification_report() takes 2 positional arguments but 3 were given的解决方案
  19. Java:Java还很重要吗?
  20. 云计算的定义、本质、技术和未来

热门文章

  1. PostMan 快快走开, ApiFox 来了, ApiFox 强大的Api调用工具
  2. 【Vue】图片加载中显示Loading
  3. zabbix 短信发送失败
  4. 利用redis生成订单号
  5. stm32cubeMX基于HAL库点亮LED灯教程
  6. SQL Server 2008 远程过程调用失败的问题解决方法
  7. SuperMap iDesktop许可模块介绍
  8. java获取连续日期天数
  9. bootstrap—预定义样式风格
  10. 一种最低级的按键状态机