记一次MongoDB处理rollback失败 replSet too much data to roll back

环境说明

  • 操作系统:CentOS Linux release 8.2.2004 (Core)
  • MongoDB版本:3.6.21

问题描述

线上MongoDB 节点挂了,自动拉起之后,没过多久起不来了….
查看日志发现是rollback失败了

$ grep 'rsBackgroundSync' replication.log
2022-04-25T15:56:22.777+0800 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1650870067, 9), t: 149 }. source's GTE: { ts: Timestamp(1650870144, 1), t: 150 } hashes: (-4603273711463716908/-773514121576334543)
2022-04-25T15:56:22.777+0800 I REPL     [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }
2022-04-25T15:56:22.777+0800 I REPL     [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method because UUID support is feature compatible with featureCompatibilityVersion 3.6.
2022-04-25T15:56:22.777+0800 I REPL     [rsBackgroundSync] transition to ROLLBACK from SECONDARY
2022-04-25T15:56:22.777+0800 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 1
2022-04-25T15:56:22.777+0800 I ROLLBACK [rsBackgroundSync] Starting rollback. Sync source: 172.16.31.47:27018
2022-04-25T15:56:22.779+0800 I ROLLBACK [rsBackgroundSync] Finding the Common Point
2022-04-25T15:56:22.782+0800 I ROLLBACK [rsBackgroundSync] our last optime:   Timestamp(1650870067, 9)
2022-04-25T15:56:22.782+0800 I ROLLBACK [rsBackgroundSync] their last optime: Timestamp(1650873382, 167)
2022-04-25T15:56:22.782+0800 I ROLLBACK [rsBackgroundSync] diff in end of log times: -3315 seconds
2022-04-25T15:56:45.012+0800 I ROLLBACK [rsBackgroundSync] Rollback common point is { ts: Timestamp(1650869833, 2586), t: 149 }
2022-04-25T15:56:45.012+0800 I REPL     [rsBackgroundSync] Incremented the rollback ID to 22
2022-04-25T15:56:45.012+0800 I ROLLBACK [rsBackgroundSync] Starting refetching documents
2022-04-25T15:57:58.580+0800 I ROLLBACK [rsBackgroundSync] Rollback finished. The final minValid is: { ts: Timestamp(1650776551, 102), t: 148 }
2022-04-25T15:57:58.580+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back.
2022-04-25T15:57:58.580+0800 F -        [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1516
2022-04-25T15:57:58.580+0800 F -        [rsBackgroundSync] \n\n***aborting after fassert() failure\n\n

报错:Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back

去网上找了一圈,发现并没有处理方案

rollback失败的原理可以去看其他人写的文章,这里就不赘述了:

问题处理

https://jira.mongodb.org/browse/SERVER-47918

Under condition #1 the 300MB rollback limit is no longer enforced post-4.0

这里说了,4.0之后就没有这个限制了…

那会不会是硬编码限制,既然是硬编码限制那是不是可以通过改代码来解决?

拉代码看一下

$ git checkout r3.6.21

打开vim src/mongo/db/repl/rs_rollback.cpp的1028行

            // Checks that the total amount of data that needs to be refetched is at most
            // 300 MB. We do not roll back more than 300 MB of documents in order to
            // prevent out of memory errors from too much data being stored. See SERVER-23392.
            if (totalSize >= 300 * 1024 * 1024) {
                throw RSFatalException("replSet too much data to roll back.");
            }

发现只是一个硬编码限制,那就把这段if注释了,重新编译下mongod。

编译过程不在赘述了,后面问题就解决了