判断备机是否需要重建

文摘   2024-11-14 17:31   广东  

判断备机是否需要重建

  • 前置条件

    • 传统主备1主n备集群

gs_ctl build -b check介绍

build -b check 是openGauss提供的检查备机是否需要重建的命令,当备机发生故障恢复后,我们可以通过该命令检查备机是否需要重建。build check的返回接口有三种:增量,全量,不需要重建。auto build检验逻辑与build check一致,只不过auto build会自动执行build命令。

流程

1.读取主机和备机的pg_control的ckpt2.通过ckpt 开始去寻找最大的共同分叉点3.如果找不到公共分叉点,证明主机日志已经被回收,需要做全量build4.如果能找到最大公共分叉点,且这一点与备机ckptrec相等,则证明日志无分叉,只是落后,无需build5.能找到日志分叉点,且这一点不是备机最大ckpt,需要做增量build 

使用效果

1. 全量build 这里手动删除部分xlog,模拟日志被回收的情况
[czk@openGauss82 ~]$ gs_ctl build -b check[2024-10-11 09:15:54.559][1678748][][gs_ctl]: gs_ctl build check ,datadir is /opt/czk/install/data/dn[2024-10-11 09:15:54.559][1678748][][gs_ctl]: fopen build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-11 09:15:54.559][1678748][][gs_ctl]: fprintf build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-11 09:15:54.587][1678748][][gs_ctl]: fsync build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-11 09:15:54.587][1678748][][gs_ctl]: stop failed, killing gaussdb by force ...[2024-10-11 09:15:54.587][1678748][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/opt/czk/install/data/dn")  print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/opt/czk/install/data/dn] [2024-10-11 09:15:54.637][1678748][][gs_ctl]: server stopped[2024-10-11 09:15:54.638][1678748][][gs_ctl]: current workdir is (/home/czk).[2024-10-11 09:15:54.640][1678748][dn_6001_6002][gs_ctl]: Get repl_auth_mode is  and repl_uuid is [2024-10-11 09:15:54.680][1678748][dn_6001_6002][gs_ctl]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.750][1678748][dn_6001_6002][gs_rewind]: connected to server: host=20.20.20.79 port=19219 dbname=postgres application_name=gs_rewind connect_timeout=5 rw_timeout=600[2024-10-11 09:15:54.754][1678748][dn_6001_6002][gs_rewind]: connect to primary success[2024-10-11 09:15:54.754][1678748][dn_6001_6002][gs_rewind]: find last checkpoint at 0/18003860 and checkpoint redo at 0/18003860 from target control file[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: get primary pg_control success[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: target server was interrupted in mode 1.[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: sanityChecks success[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: find last checkpoint at 0/180036A0 and checkpoint redo at 0/18003620 from source control file[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: find max lsn success, find max lsn rec (0/18003860) success.[2024-10-11 09:15:54.756][1678748][dn_6001_6002][gs_rewind]: Get repl_auth_mode is  and repl_uuid is [2024-10-11 09:15:54.795][1678748][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.795][1678748][dn_6001_6002][gs_rewind]: request lsn is 0/180036A0 and its crc(source, target):[1158223492, 3927131982][2024-10-11 09:15:54.840][1678748][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.840][1678748][dn_6001_6002][gs_rewind]: request lsn is 0/18003580 and its crc(source, target):[3680505096, 799574682][2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: request lsn is 0/18003460 and its crc(source, target):[545018517, 545018517][2024-10-11 09:18:51.453][1755902][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:18:51.453][1755902][dn_6001_6002][gs_rewind]: request lsn is 0/160002E8 and its crc(source, target):[0, 1075449653][2024-10-11 09:18:51.492][1755902][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:18:51.492][1755902][dn_6001_6002][gs_rewind]: request lsn is 0/160001C8 and its crc(source, target):[0, 649075532][2024-10-11 09:18:51.518][1755902][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:18:51.518][1755902][dn_6001_6002][gs_rewind]: request lsn is 0/160000A8 and its crc(source, target):[0, 1029292914]……[2024-10-11 09:18:51.519][1755902][dn_6001_6002][gs_rewind]: could not find previous WAL record at 0/15000058: read xlog page failed at 0/15000058
gs_rewind receive FATAL, it will exit[2024-10-11 09:18:51.519][1755902][dn_6001_6002][gs_rewind]: Build check result : full build[2024-10-11 09:18:51.519][1755902][dn_6001_6002][gs_rewind]: build check failed(/opt/czk/install/data/dn).
2. 增量build
[czk@openGauss82 ~]$ gs_ctl build -b check[2024-10-11 09:15:54.559][1678748][][gs_ctl]: gs_ctl build check ,datadir is /opt/czk/install/data/dn[2024-10-11 09:15:54.559][1678748][][gs_ctl]: fopen build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-11 09:15:54.559][1678748][][gs_ctl]: fprintf build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-11 09:15:54.587][1678748][][gs_ctl]: fsync build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-11 09:15:54.587][1678748][][gs_ctl]: stop failed, killing gaussdb by force ...[2024-10-11 09:15:54.587][1678748][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/opt/czk/install/data/dn")  print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/opt/czk/install/data/dn] [2024-10-11 09:15:54.637][1678748][][gs_ctl]: server stopped[2024-10-11 09:15:54.638][1678748][][gs_ctl]: current workdir is (/home/czk).[2024-10-11 09:15:54.640][1678748][dn_6001_6002][gs_ctl]: Get repl_auth_mode is  and repl_uuid is [2024-10-11 09:15:54.680][1678748][dn_6001_6002][gs_ctl]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.750][1678748][dn_6001_6002][gs_rewind]: connected to server: host=20.20.20.79 port=19219 dbname=postgres application_name=gs_rewind connect_timeout=5 rw_timeout=600[2024-10-11 09:15:54.754][1678748][dn_6001_6002][gs_rewind]: connect to primary success[2024-10-11 09:15:54.754][1678748][dn_6001_6002][gs_rewind]: find last checkpoint at 0/18003860 and checkpoint redo at 0/18003860 from target control file[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: get primary pg_control success[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: target server was interrupted in mode 1.[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: sanityChecks success[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: find last checkpoint at 0/180036A0 and checkpoint redo at 0/18003620 from source control file[2024-10-11 09:15:54.755][1678748][dn_6001_6002][gs_rewind]: find max lsn success, find max lsn rec (0/18003860) success.[2024-10-11 09:15:54.756][1678748][dn_6001_6002][gs_rewind]: Get repl_auth_mode is  and repl_uuid is [2024-10-11 09:15:54.795][1678748][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.795][1678748][dn_6001_6002][gs_rewind]: request lsn is 0/180036A0 and its crc(source, target):[1158223492, 3927131982][2024-10-11 09:15:54.840][1678748][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.840][1678748][dn_6001_6002][gs_rewind]: request lsn is 0/18003580 and its crc(source, target):[3680505096, 799574682][2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: request lsn is 0/18003460 and its crc(source, target):[545018517, 545018517][2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: find common checkpoint 0/18003460[2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: find diverge point success[2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: Build check result : incremental build[2024-10-11 09:15:54.869][1678748][dn_6001_6002][gs_rewind]: build check completed(/opt/czk/install/data/dn).
3. 不需要build
[czk@openGauss82 ~]$ gs_ctl build -b check[2024-10-14 14:05:31.218][2966707][][gs_ctl]: gs_ctl build check ,datadir is /opt/czk/install/data/dn[2024-10-14 14:05:31.218][2966707][][gs_ctl]: fopen build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-14 14:05:31.218][2966707][][gs_ctl]: fprintf build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-14 14:05:31.239][2966707][][gs_ctl]: fsync build pid file "/opt/czk/install/data/dn/gs_build.pid" success[2024-10-14 14:05:31.239][2966707][][gs_ctl]: stop failed, killing gaussdb by force ...[2024-10-14 14:05:31.239][2966707][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/opt/czk/install/data/dn")  print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/opt/czk/install/data/dn] [2024-10-14 14:05:31.290][2966707][][gs_ctl]: server stopped[2024-10-14 14:05:31.290][2966707][][gs_ctl]: current workdir is (/home/czk).[2024-10-14 14:05:31.292][2966707][dn_6001_6002][gs_ctl]: Get repl_auth_mode is  and repl_uuid is [2024-10-14 14:05:31.322][2966707][dn_6001_6002][gs_ctl]: build try host(20.20.20.79) port(19219) success[2024-10-14 14:05:31.391][2966707][dn_6001_6002][gs_rewind]: connected to server: host=20.20.20.79 port=19219 dbname=postgres application_name=gs_rewind connect_timeout=5 rw_timeout=600[2024-10-14 14:05:31.398][2966707][dn_6001_6002][gs_rewind]: connect to primary success[2024-10-14 14:05:31.398][2966707][dn_6001_6002][gs_rewind]: find last checkpoint at 0/2F4C6AE0 and checkpoint redo at 0/2F4C6A60 from target control file[2024-10-14 14:05:31.399][2966707][dn_6001_6002][gs_rewind]: get primary pg_control success[2024-10-14 14:05:31.399][2966707][dn_6001_6002][gs_rewind]: target server was interrupted in mode 2.[2024-10-14 14:05:31.399][2966707][dn_6001_6002][gs_rewind]: sanityChecks success[2024-10-14 14:05:31.399][2966707][dn_6001_6002][gs_rewind]: find last checkpoint at 0/2F4C6AE0 and checkpoint redo at 0/2F4C6A60 from source control file[2024-10-14 14:05:31.411][2966707][dn_6001_6002][gs_rewind]: find max lsn success, find max lsn rec (0/2F4C6AE0) success.[2024-10-14 14:05:31.411][2966707][dn_6001_6002][gs_rewind]: Get repl_auth_mode is  and repl_uuid is [2024-10-14 14:05:31.437][2966707][dn_6001_6002][gs_rewind]: build try host(20.20.20.79) port(19219) success[2024-10-14 14:05:31.437][2966707][dn_6001_6002][gs_rewind]: request lsn is 0/2F4C6AE0 and its crc(source, target):[757210003, 757210003][2024-10-14 14:05:31.437][2966707][dn_6001_6002][gs_rewind]: find common checkpoint 0/2F4C6AE0[2024-10-14 14:05:31.437][2966707][dn_6001_6002][gs_rewind]: find diverge point success[2024-10-14 14:05:31.437][2966707][dn_6001_6002][gs_rewind]: Build check result : needless build[2024-10-14 14:05:31.438][2966707][dn_6001_6002][gs_rewind]: build check completed(/opt/czk/install/data/dn).

点击阅读原文跳转作者文章


openGauss
开源关系型数据库
 最新文章