KingbaseES V8R6集群管理运维案例之---repmgr standby switchover故障
阅读原文时间:2023年07月08日阅读:3

案例说明:

在KingbaseES V8R6集群备库执行“repmgr standby switchover”时,切换失败,并且在执行过程中,伴随着“repmr standby follow”操作,本案例详细记录了解决此问题的过程。

适用版本:

KingbaseES V8R6

集群节点信息:

一、备库执行switchover操作

1、执行switchover切换

[kingbase@node101 bin]$ ./repmgr standby switchover -h 192.168.1.102 -U esrep -d esrep
WARNING: following problems with command line parameters detected:
  database connection parameters not required when executing UNKNOWN ACTION
NOTICE: executing switchover on node "node101" (ID: 1)
ERROR: local node "node101" (ID: 1) is not a downstream of demotion candidate primary "node102" (ID: 2)
DETAIL: local node has no registered upstream node
HINT: execute "repmgr standby register --force" to update the local node's metadata

2、切换失败信息

3、查看集群节点状态

[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | standby |   running |          | default  | 100      | 6        | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | primary | * running |          | default  | 100      | 6        | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=如下所示,standby节点的upstream为空,无法执行switchover。=

二、配置standby节点的upstream(repmgr standby follow)

1、执行“repmgr standby follow”

[kingbase@node101 bin]$ ./repmgr standby follow -h 192.168.1.102 -U esrep -d esrep
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 1/CE004F50, follow target lsn is 1/CE004F50
ERROR: slot "repmgr_slot_1" already exists as an active slot

NOTICE: STANDBY FOLLOW failed
DETAIL: slot "repmgr_slot_1" already exists as an active slot

# standby的replication slot是active状态

test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confir
med_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+-------
--------------
 repmgr_slot_1 |        | physical  |        |          | f         | t      |       8596 | 1917 |              | 1/CE005038  |
(1 row)

2、停止数据库删除standby的replication slot

# 关闭备库数据库服务
[kingbase@node101 bin]$ ./sys_ctl stop -D ../data
waiting for server to shut down.... done
server stopped

# 注释kingbase.auto.conf中slot参数
[kingbase@node101 bin]$ cat ../data/kingbase.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
enable_upper_colname = 'on'
client_idle_timeout = '0'
synchronous_standby_names = ''
wal_retrieve_retry_interval = '5000'
primary_conninfo = 'user=system connect_timeout=10 host=192.168.1.102 port=54321 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 application_name=node101'
recovery_target_timeline = 'latest'
# primary_slot_name = 'repmgr_slot_1'

# 查看slot状态
test=# select * from sys_replication_slots;
   slot_name   | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confir
med_flush_lsn
---------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+-------
--------------
 repmgr_slot_1 |        | physical  |        |          | f         | f      |            | 1922 |              | 1/CE005038  |
(1 row)

# 删除备库replication slot
test=# select sys_drop_replication_slot('repmgr_slot_1');
 sys_drop_replication_slot
---------------------------

(1 row)

3、启动数据库服务执行"repmgr standby follow"

[kingbase@node101 bin]$ ./sys_ctl start -D ../data
waiting for server to start....2022-08-09 10:39:50.600 CST [6829] WARNING:  enable_upper_colname can only be opened
......
server started

[kingbase@node101 bin]$ ./repmgr standby follow -h 192.168.1.102 -U esrep -d esrep
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 1/CE0052E0, follow target lsn is 1/CE0052E0
NOTICE: setting node 1's upstream to node 2
NOTICE: begin to stopp server at 2022-08-09 10:39:55.101228
NOTICE: stopping server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl  -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile -w -t 90 -m fast stop"
NOTICE: stopp server finish at 2022-08-09 10:39:55.205646
NOTICE: begin to start server at 2022-08-09 10:39:55.205705
NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-08-09 10:39:55.316793
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "node102" (ID: 2)

# 集群节点状态正常
[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | standby |   running | node102  | default  | 100      | 6        | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | primary | * running |          | default  | 100      | 6        | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

三、执行‘repmgr standby switchover’

[kingbase@node101 bin]$ ./repmgr standby switchover -h 192.168.1.102 -U esrep -d esrep
WARNING: following problems with command line parameters detected:
  database connection parameters not required when executing UNKNOWN ACTION
NOTICE: executing switchover on node "node101" (ID: 1)
INFO: The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=0
"
INFO: pausing repmgrd on node "node101" (ID 1)
INFO: pausing repmgrd on node "node102" (ID 2)
NOTICE: local node "node101" (ID: 1) will be promoted to primary; current primary "node102" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node102" (ID: 2)
NOTICE: issuing CHECKPOINT
DETAIL: executing server command "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl  -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile -W -m fast stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 1/D0000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node101" (ID: 1) using sys_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node101" (ID: 1) was successfully promoted to primary
NOTICE: issuing CHECKPOINT
INFO: local node 2 can attach to rejoin target node 1
DETAIL: local node's recovery point: 1/D0000028; rejoin target node's fork point: 1/D00000A0
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: begin to start server at 2022-08-09 10:46:36.382548
NOTICE: starting server using "/home/kingbase/cluster/R6HA/kha/kingbase/bin/sys_ctl  -w -t 90 -D '/home/kingbase/cluster/R6HA/kha/kingbase/data' -l /home/kingbase/cluster/R6HA/kha/kingbase/bin/logfile start"
NOTICE: start server finish at 2022-08-09 10:46:36.488870
NOTICE: replication slot "repmgr_slot_1" deleted on node 2
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1
NOTICE: switchover was successful
DETAIL: node "node101" is now primary and node "node102" is attached as standby
INFO: unpausing repmgrd on node "node101" (ID 1)
INFO: unpause node "node101" (ID 1) successfully
INFO: unpausing repmgrd on node "node102" (ID 2)
INFO: unpause node "node102" (ID 2) successfully
NOTICE: STANDBY SWITCHOVER has completed successfully

# 集群节点状态信息
[kingbase@node101 bin]$ ./repmgr cluster show
 ID | Name    | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------+---------+-----------+----------+----------+----------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1  | node101 | primary | * running |          | default  | 100      | 7        | host=192.168.1.101 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3
 2  | node102 | standby |   running | node101  | default  | 100      | 7        | host=192.168.1.102 user=system dbname=esrep port=54321 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3

=如上所示,switchover切换完成。=