KingbaseES V8R3集群运维案例之---主库系统down failover切换过程分析
阅读原文时间:2023年07月08日阅读:3

案例说明:

KingbaseES V8R3集群failover时两个cluster都会触发,但只有一个cluster会调用脚本去执行真正的切换流程,另一个有对应的打印,但不会调用脚本,只是走相关的流程去修改状态。 那么你需要确认一点,到底是主库还是备库的cluster在执行脚本,关键字:failover_stream.sh ,看cluster.log中有没有这个的名称。本案例通过主库系统宕机,分析了failover 切换过程。

适用版本:

KingbaseES V8R3

集群节点状态信息:

# 集群节点状态信息
TEST=# show pool_nodes;
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | primary | 1          | true              | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | standby | 0          | false             | 0
(2 rows)

对于KingbaseES V8R3集群failover切换的分析:

1. 可以通过主备库的cluster.log,获取整个主备切换的过程。
2. 从failover.log日志中获取执行failover过程,注意只有执行failover_stream.sh的节点才会产生failover.log。
3. recovery.log日志记录了原主库failover完成后,从主库recovery为新备库的过程。

一、Failover切换过程分析

关闭主库系统:

[root@node101 ~]# shutdown -h now

备库cluster.log日志分析:(主库系统关机,备库调用failover_stream.sh执行failover切换)

1、主备库处于正常状态时,备库收到主库的heartbeat。

2022-08-30 15:38:03: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 0 seconds ago
2022-08-30 15:38:06: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:38:06: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 0 seconds ago

2、备库kingbasecluster检测到主库集群服务和数据库服务shutdown,kingbasecluster发生主备切换,当前的kingbasecluster由standby切换为master。

2022-08-30 15:38:09: pid 17585: LOG:  remote node "192.168.1.101:9999 Linux node101" is shutting down
2022-08-30 15:38:09: pid 17585: LOG:  watchdog cluster has lost the coordinator node
2022-08-30 15:38:09: pid 17585: LOG:  unassigning the remote node "192.168.1.101:9999 Linux node101" from watchdog cluster master
2022-08-30 15:38:09: pid 17585: LOG:  We have lost the cluster master node "192.168.1.101:9999 Linux node101"
2022-08-30 15:38:09: pid 17585: LOG:  watchdog node state changed from [STANDBY] to [JOINING]
2022-08-30 15:38:09: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:38:09: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 1 seconds ago
2022-08-30 15:38:10: pid 17547: ERROR:  failed to authenticate
2022-08-30 15:38:10: pid 17547: DETAIL:  the database system is shutting down
2022-08-30 15:38:12: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:38:12: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 4 seconds ago

3、备库kingbasecluster对主库的数据库服务根据:health_check_max_retries * health_check_retry_delay配置尝试连接。

2022-08-30 15:38:13: pid 17547: LOG:  health checking retry count 1
2022-08-30 15:38:13: pid 17585: LOG:  watchdog node state changed from [JOINING] to [INITIALIZING]
2022-08-30 15:38:14: pid 17585: LOG:  I am the only alive node in the watchdog cluster
2022-08-30 15:38:14: pid 17585: HINT:  skipping stand for coordinator state
2022-08-30 15:38:14: pid 17585: LOG:  watchdog node state changed from [INITIALIZING] to [MASTER]
2022-08-30 15:38:14: pid 17585: LOG:  I am announcing my self as master/coordinator watchdog node
2022-08-30 15:38:15: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:38:15: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 7 seconds ago
2022-08-30 15:38:18: pid 17585: LOG:  I am the cluster leader node
2022-08-30 15:38:18: pid 17585: DETAIL:  our declare coordinator message is accepted by all nodes
2022-08-30 15:38:18: pid 17585: LOG:  setting the local node "192.168.1.102:9999 Linux node102" as watchdog cluster master

4、备库开启升级进程(escalation process),cluster vip漂移到新的master节点。

2022-08-30 15:38:18: pid 17585: LOG:  I am the cluster leader node. Starting escalation process
2022-08-30 15:38:18: pid 17585: LOG:  escalation process started with PID:22088
2022-08-30 15:38:18: pid 17547: LOG:  connect_inet_domain_socket: select() interrupted by certain signal. retrying...
2022-08-30 15:38:18: pid 22088: LOG:  watchdog: escalation started
2022-08-30 15:38:18: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:38:18: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 10 seconds ago
2022-08-30 15:38:21: pid 22088: LOG:  successfully acquired the delegate IP:"192.168.1.201"
2022-08-30 15:38:21: pid 22088: DETAIL:  'if_up_cmd' returned with success
.......

5、 备库对原主库数据库服务的连接次数超过阈值后,将原主库的数据库服务设置为down状态。

2022-08-30 15:39:20: pid 17547: LOG:  health checking retry count 10
2022-08-30 15:39:22: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:39:22: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 73 seconds ago
2022-08-30 15:39:23: pid 17547: LOG:  failed to connect to kingbase server on "192.168.1.101:54321", getsockopt() detected error "
No route to host"
2022-08-30 15:39:23: pid 17547: ERROR:  failed to make persistent db connection
2022-08-30 15:39:23: pid 17547: DETAIL:  connection to host:"192.168.1.101:54321" failed
2022-08-30 15:39:23: pid 17547: LOG:  setting backend node 0 status to NODE DOWN

6、 收到本节点kingbasecluster发出的failover request,并获取到failover lock。

2022-08-30 15:39:23: pid 17547: LOG:  received degenerate backend request for node_id: 0 from pid [17547]
2022-08-30 15:39:23: pid 17585: LOG:  new IPC connection received
2022-08-30 15:39:23: pid 17585: LOG:  watchdog received the failover command from local kingbasecluster on IPC interface
2022-08-30 15:39:23: pid 17585: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local
 kingbasecluster on IPC interface
2022-08-30 15:39:23: pid 17585: LOG:  I am the only kingbasecluster node in the watchdog cluster
2022-08-30 15:39:23: pid 17585: DETAIL:  no need to propagate the failover command [DEGENERATE_BACKEND_REQUEST]
2022-08-30 15:39:23: pid 17585: LOG:  new IPC connection received
2022-08-30 15:39:23: pid 17585: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2022-08-30 15:39:23: pid 17585: LOG:  local kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to become a lock
 holder for failover ID: 0

7、 获取到failover lock后,开始执行failover_stream.sh进行failover的切换。

2022-08-30 15:39:23: pid 17585: LOG:  local kingbasecluster node "192.168.1.102:9999 Linux node102" is the lock holder
2022-08-30 15:39:23: pid 17547: LOG:  starting degeneration. shutdown host 192.168.1.101(54321)
2022-08-30 15:39:23: pid 17547: LOG:  Restart all children
2022-08-30 15:39:23: pid 17547: LOG:  execute command: /home/kingbase/cluster/R3HA/kingbasecluster/bin/failover_stream.sh 192.168.
1.102 0 0 192.168.1.101 192.168.1.101 1 0 /home/kingbase/cluster/R3HA/db/data
2022-08-30 15:39:25: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:39:25: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 76 seconds ago
ssh: connect to host 192.168.1.101 port 22: No route to host
2022-08-30 15:39:31: pid 17612: LOG:  watchdog checking if kingbasecluster is alive using heartbeat
2022-08-30 15:39:31: pid 17612: DETAIL:  the last heartbeat from "192.168.1.101:9999" received 82 seconds ago
2022-08-30 15:39:33: pid 17585: LOG:  new IPC connection received
2022-08-30 15:39:33: pid 17585: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2022-08-30 15:39:33: pid 17585: LOG:  local kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to release [FAIL
OVER] lock for failover ID 0

8、执行failover切换,并查找新的primary。(当前备库切换为primary)

2022-08-30 15:39:33: pid 17547: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2022-08-30 15:39:33: pid 17547: LOG:  find_primary_node: checking backend no 0
2022-08-30 15:39:33: pid 17547: LOG:  find_primary_node: checking backend no 1
2022-08-30 15:39:33: pid 17547: LOG:  find_primary_node: primary node id is 1
2022-08-30 15:39:33: pid 17585: LOG:  new IPC connection received
2022-08-30 15:39:33: pid 17585: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2022-08-30 15:39:33: pid 17585: LOG:  local kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to release [FOLL
OW MASTER] lock for failover ID 0
2022-08-30 15:39:33: pid 17585: LOG:  local kingbasecluster node "192.168.1.102:9999 Linux node102" has released the [FOLLOW MASTE
R] lock for failover ID 0
2022-08-30 15:39:33: pid 17547: LOG:  failover: set new primary node: 1
2022-08-30 15:39:33: pid 17547: LOG:  failover: set new master node: 1
2022-08-30 15:39:33: pid 21501: LOG:  worker process received restart request
2022-08-30 15:39:33: pid 17585: LOG:  new IPC connection received
2022-08-30 15:39:33: pid 17585: LOG:  received the failover command lock request from local kingbasecluster on IPC interface
2022-08-30 15:39:33: pid 17585: LOG:  local kingbasecluster node "192.168.1.102:9999 Linux node102" is requesting to resign from a
 lock holder for failover ID 0
2022-08-30 15:39:33: pid 17585: LOG:  local kingbasecluster node "192.168.1.102:9999 Linux node102" has resigned from the lock hol
der
failover done. shutdown host 192.168.1.101(54321)2022-08-30 15:39:33: pid 17547: LOG:  failover done. shutdown host 192.168.1.101(
54321)

备库Failover.log分析:

1、执行failover_stream.sh进行failover切换

-----------------2022-08-30 15:39:23 failover beging---------------------------------------
----failover-stats is %H = hostname of the new master node [192.168.1.102], %P = old primary node id [0], %d = node id[0], %h = host name [192.168.1.101], %O = old primary host[192.168.1.101] %m = new master node id [1], %M = old master node id [0], %D = database cluster path [/home/kingbase/cluster/R3HA/db/data].

Tips:
     1)new_master : 192.168.1.102
     2) new_master_id:   1
     3) old_primary: 192.168.1.101
     4) old_primary_id : 0

2、ping 网关测试网络连通性

----ping trust ip
ping trust ip 192.168.1.1 success

3、ssh连接远程原主库,执行关闭数据库服务和释放db vip;并将db vip加载到当前主库节点。

----determine whether the faulty db is master or standby
master down, let 192.168.1.102 become new primary.....
 2022-08-30 15:39:25 del old primary VIP on 192.168.1.101
ssh connect host:192.168.1.101 failed, can not stop old primary db and del old primary db vip
2022-08-30 15:39:29 add VIP on 192.168.1.102
ADD VIP NOW AT 2022-08-30 15:39:30 ON enp0s3
execute: [/sbin/ip addr add 192.168.1.200/24 dev enp0s3 label enp0s3:2]
execute: /home/kingbase/cluster/R3HA/db/bin/arping -U 192.168.1.200 -I enp0s3 -w 1
ARPING 192.168.1.200 from 192.168.1.200 enp0s3
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
2022-08-30 15:39:31 promote begin...let 192.168.1.102 become master

4、通过连接db vip执行“select 33333”测试当前数据库服务状态,如果正常,执行promote,提升为新主库,将同步模式设为async。

check db if is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"  -c "select 33333;"
2022-08-30 15:39:31 kingbase is ok , to prepare execute promote
execute promote
server promoting
check db if is alive after promote
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"  -c "select 33333;"
2022-08-30 15:39:32 after execute promote , kingbase status is ok.
after execute promote, kingbase is ok.
2022-08-30 15:39:32 sync to async
ALTER SYSTEM
 SYS_RELOAD_CONF
-----------------
 t
(1 row)

5、连接新主库数据库服务,如果成功,在新主库上执行checkpoint,checkpoint成功,则failover切换完成。

2022-08-30 15:39:32 make checkpoint
check the db to see if it is alive
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10"  -c "select 33333;"
2022-08-30 15:39:32 kingbase is ok , to prepare execute checkpoint
execute checkpoint
CHECKPOINT
check the db to see if it is alive after execute checkpoint
ksql "port=54321 user=SUPERMANAGER_V8ADMIN  dbname=TEST connect_timeout=10"  -c "select 33333;"
2022-08-30 15:39:33 after execute checkpoint, kingbase is ok.
after execute checkpoint, kingbase is ok.
-----------------2022-08-30 15:39:33 failover end---------------------------------------

二、原主库recovery为新备库

1、在原主库生成recovery.conf文件(系统通过crond计划任务,调用network_rewind.sh脚本自动recovery为备库)

[kingbase@node101 data]$ cp ../etc/recovery.done ./recovery.conf

2、查看recovery.log日志信息

[kingbase@node101 log]$  more recovery.log
---------------------------------------------------------------------

1)开启recovery,检测存储读写功能是否正常。

2022-08-30 16:42:01 recover beging...
my pid is 11927,officially began to perform recovery
2022-08-30 16:42:01 check read/write on mount point
2022-08-30 16:42:01 check read/write on mount point (1 / 6).
2022-08-30 16:42:01 stat the directory of the mount point "/home/kingbase/cluster/R3HA/db/data" ...
2022-08-30 16:42:01 stat the directory of the mount point "/home/kingbase/cluster/R3HA/db/data" ... OK
2022-08-30 16:42:01 create/write the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ...
2022-08-30 16:42:01 create/write the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ... OK
2022-08-30 16:42:01 stat the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ...
2022-08-30 16:42:01 stat the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ... OK
2022-08-30 16:42:01 read the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ...
2022-08-30 16:42:01 read the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ... OK
2022-08-30 16:42:01 delete the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ...
2022-08-30 16:42:01 delete the file "/home/kingbase/cluster/R3HA/db/data/rw_status_file_574940445" ... OK
2022-08-30 16:42:01 success to check read/write on mount point (1 / 6).
2022-08-30 16:42:01 check read/write on mount point ... ok

2)ping 网关检测网络的连通性

2022-08-30 16:42:01 check if the network is ok
ping trust ip 192.168.1.1 success ping times :[3], success times:[2]

3)判断当前节点role(master OR standby),如果是standby节点,并判断是否需要recovery。( I,m status is [1] (1 is down), I will be in recovery.)

determine if i am master or standby
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | down   | 0.500000  | standby | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
(2 rows)

i am standby in cluster,determine if recovery is needed
2022-08-30 16:42:03 now will del vip [192.168.1.200/24]
but no 192.168.1.200/24 on my DEV, nothing to do with del
ksql execute success,but node:node101 does not have correct streaming(or catchup) replication ,will retry ,retry times:[1/10]
......
ksql execute success,but node:node101 does not have correct streaming(or catchup) replication ,will retry ,retry times:[10/10]
primary node/Im node status is changed, primary ip[192.168.1.102], recovery.conf NEED_CHANGE [0] (0 is need ), I,m status is [1] (1 is down), I will be in recovery.
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | down   | 0.500000  | standby | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
(2 rows)

4)如果需要做recovery,执行sys_rewind。

if recover node up, let it down , for rewind
2022-08-30 16:42:34 sys_rewind...
sys_rewind  --target-data=/home/kingbase/cluster/R3HA/db/data --source-server="host=192.168.1.102 port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST"
datadir_source = /home/kingbase/cluster/R3HA/db/data
rewinding from last common checkpoint at 0/66000028 on timeline 15
find last common checkpoint start time from 2022-08-30 16:42:34.727711 CST to 2022-08-30 16:42:34.839545 CST, in "0.111834" seconds.
reading source file list
reading target file list
reading WAL in target
Rewind datadir file from source
update the control file: minRecoveryPoint is '0/68024E80', minRecoveryPointTLI is '16', and database state is 'in archive recovery'
rewind start wal location 0/66000028 (file 0000000F0000000000000066), end wal location 0/68024E80 (file 000000100000000000000068). time from 2022-08-30 16:42:36.727711 CST to 2022-08-30 16:42:37.201678 CST, in "2.473967" seconds.
Done!

5)sys_rewind成功,修改集群配置文件。

 sed conf change #synchronous_standby_names
2022-08-30 16:42:39 file operate
cp recovery.conf...
 change recovery.conf ip -> primary.ip
2022-08-30 16:42:39 no need change recovery.conf, primary node is 192.168.1.102

6)启动新备库数据库服务,删除原复制槽并重建。备库数据库服务启动成功后,配置同步为sync。

delete pid file if exist
del the replication_slots if exis
drop the slot [slot_node101].
drop the slot [slot_node102].
2022-08-30 16:42:39 start up the kingbase...
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "/home/kingbase/cluster/R3HA/db/data/sys_log".
 done
server started
ksql "port=54321 user=SUPERMANAGER_V8ADMIN dbname=TEST connect_timeout=10"   -c "select 33333;"
 SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
 (slot_node101,)
(1 row)

2022-08-30 16:42:41 create the slot [slot_node101] success.
 SYS_CREATE_PHYSICAL_REPLICATION_SLOT
--------------------------------------
 (slot_node102,)
(1 row)

2022-08-30 16:42:41 create the slot [slot_node102] success.
2022-08-30 16:42:41 start up standby successful!
cluster is sync cluster.
SYNC RECOVER MODE ...
2022-08-30 16:42:41 remote primary node change sync
ALTER SYSTEM
 SYS_RELOAD_CONF
-----------------
 t
(1 row)

SYNC RECOVER MODE DONE

7)执行pcp_attach_node将当前节点加入到集群。

2022-08-30 16:42:44 attach pool...
IM Node is 0, will try [pcp_attach_node -U kingbase -W MTIzNDU2 -h 192.168.1.201 -n 0]
pcp_attach_node -- Command Successful
 node_id |   hostname    | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay
---------+---------------+-------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.1.101 | 54321 | up     | 0.500000  | standby | 0          | false             | 0
 1       | 192.168.1.102 | 54321 | up     | 0.500000  | primary | 0          | true              | 0
(2 rows)

2022-08-30 16:42:45 attach end..
recovery success,exit script with success

三、总结

通过对原主库执行系统重启服务,如果备库在阈值范围内尝试连接主库数据库服务失败,其failover切换过程和主库系统shutdown是一致的。