记录一次hadoop2.8.4版本RM接入zk ha问题
阅读原文时间:2023年07月11日阅读:1

背景:

公司将线上hadoop RM接入ZK 实现高可用 但ZK Znode 默认存储1M,当存储数据量大时候可能导致线上业务的崩溃

处理方案如下:

1,修改ZK配置 增加默认存储上限

2,修改RM数据存储在zk中的路径结构 使结构拆分能支撑更大的数据

问题一 修改ZK配置 增加默认存储上限

主要为修改配置参数

在zk各节点上修改配置 (修改为10M大小)

vi zkServer.sh

新增配置到图中位置  ZOO_USER_CFG="-Djute.maxbuffer=10240000"

修改zkCli.sh  (不修改 客户端命令行 将不能取得超出1M的数据)

即使如此 当我们代码客户端也不能取得超出大小的数据 需要添加环境变量 如下

System.setProperty("jute.maxbuffer",String.valueOf(10240000));

同样的yarn的配置也要修改 不然也是白搭
yarn-env.sh
新增一行

YARN_RESOURCEMANAGER_OPTS="$YARN_RESOURCEMANAGER_OPTS -Djute.maxbuffer=10240000"

问题2 优化zk中存储结构 yarn 在zk中的存储如下

ROOT_DIR_PATH
|--- VERSION_INFO
|--- EPOCH_NODE
|--- RM_ZK_FENCING_LOCK
|--- RM_APP_ROOT
| |----- (#ApplicationId1)
| | |----- (#ApplicationAttemptIds)
| |
| |----- (#ApplicationId2)
| | |----- (#ApplicationAttemptIds)
| ….
|
|--- RM_DT_SECRET_MANAGER_ROOT
|----- RM_DT_SEQUENTIAL_NUMBER_ZNODE_NAME
|----- RM_DELEGATION_TOKENS_ROOT_ZNODE_NAME
| |----- Token_1
| |----- Token_2
| ….
|
|----- RM_DT_MASTER_KEYS_ROOT_ZNODE_NAME
| |----- Key_1
| |----- Key_2
….
|--- AMRMTOKEN_SECRET_MANAGER_ROOT
|----- currentMasterKey
|----- nextMasterKey

更新为:

* The znode structure is as follows:
* ROOT_DIR_PATH
* |--- VERSION_INFO
* |--- EPOCH_NODE
* |--- RM_ZK_FENCING_LOCK
* |--- RM_APP_ROOT
* | |----- HIERARCHIES
* | | |----- 1
* | | | |----- (#ApplicationId barring last character)
* | | | | |----- (#Last character of ApplicationId)
* | | | | | |----- (#ApplicationAttemptIds)
* | | | ….
* | | |
* | | |----- 2
* | | | |----- (#ApplicationId barring last 2 characters)
* | | | | |----- (#Last 2 characters of ApplicationId)
* | | | | | |----- (#ApplicationAttemptIds)
* | | | ….
* | | |
* | | |----- 3
* | | | |----- (#ApplicationId barring last 3 characters)
* | | | | |----- (#Last 3 characters of ApplicationId)
* | | | | | |----- (#ApplicationAttemptIds)
* | | | ….
* | | |
* | | |----- 4
* | | | |----- (#ApplicationId barring last 4 characters)
* | | | | |----- (#Last 4 characters of ApplicationId)
* | | | | | |----- (#ApplicationAttemptIds)
* | | | ….
* | | |
* | |----- (#ApplicationId1)
* | | |----- (#ApplicationAttemptIds)
* | |
* | |----- (#ApplicationId2)
* | | |----- (#ApplicationAttemptIds)
* | ….
* |
* |--- RM_DT_SECRET_MANAGER_ROOT
* |----- RM_DT_SEQUENTIAL_NUMBER_ZNODE_NAME
* |----- RM_DELEGATION_TOKENS_ROOT_ZNODE_NAME
* | |----- 1
* | | |----- (#TokenId barring last character)
* | | | |----- (#Last character of TokenId)
* | | ….
* | |----- 2
* | | |----- (#TokenId barring last 2 characters)
* | | | |----- (#Last 2 characters of TokenId)
* | | ….
* | |----- 3
* | | |----- (#TokenId barring last 3 characters)
* | | | |----- (#Last 3 characters of TokenId)
* | | ….
* | |----- 4
* | | |----- (#TokenId barring last 4 characters)
* | | | |----- (#Last 4 characters of TokenId)
* | | ….
* | |----- Token_1
* | |----- Token_2
* | ….
* |
* |----- RM_DT_MASTER_KEYS_ROOT_ZNODE_NAME
* | |----- Key_1
* | |----- Key_2
* ….
* |--- AMRMTOKEN_SECRET_MANAGER_ROOT
* |----- currentMasterKey
* |----- nextMasterKey
*
* |-- RESERVATION_SYSTEM_ROOT
* |------PLAN_1
* | |------ RESERVATION_1
* | |------ RESERVATION_2
* | ….
* |------PLAN_2
* ….

yarn-siting.xml文件新增一个配置项

<description>Index at which last section of application id (with each section  
  separated by \_ in application id) will be split so that application znode  
  stored in zookeeper RM state store will be stored as two different znodes  
  (parent-child). Split is done from the end.  
  For instance, with no split, appid znode will be of the form  
  application\_1352994193343\_0001. If the value of this config is 1, the  
  appid znode will be broken into two parts application\_1352994193343\_000  
  and 1 respectively with former being the parent node.  
  application\_1352994193343\_0002 will then be stored as 2 under the parent  
  node application\_1352994193343\_000. This config can take values from 0 to 4.  
  0 means there will be no split. If configuration value is outside this  
  range, it will be treated as config value of 0(i.e. no split). A value  
  larger than 0 (up to 4) should be configured if you are storing a large number  
  of apps in ZK based RM state store and state store operations are failing due to  
  LenError in Zookeeper.</description>  
<name>yarn.resourcemanager.zk-appid-node.split-index</name>  
<value>0</value>  

参考:https://cloud.tencent.com/developer/article/1491079

参考:https://issues.apache.org/jira/browse/YARN-2368

参考:https://issues.apache.org/jira/browse/YARN-2962