AutoTikv简介
阅读原文时间:2023年07月10日阅读:2

AutoTikv是一个用于对TiKV数据库进行自动调优的工具。它的设计灵感来自于SIGMOD 2017的一篇paper:Automatic Database Management System Tuning Through Large-scale Machine Learning,使用机器学习模型对数据库参数进行自动调优。

项目地址:https://github.com/pentium3/AutoTiKV

设计目标

整个调优过程大致如下图:

整个过程会循环跑200个round(可以用户自定义),或者也可以定义成到结果收敛为止。

AutoTiKV支持在修改参数之后重启tikv(如果不需要也可以选择不重启)。需要调节的参数和需要查看的metric可以在controller.py里声明。

以下是一个knob的声明样板:

"rocksdb.defaultcf.write-buffer-size": # name of the knob 用点分隔不同session名和knob名
{
"changebyyml": True, # True表示通过修改tikv-ansible/conf/tikv.yml来调节
"set_func": None, # 若changebyyml==False,则在此指定修改参数的函数名(函数也定义在controller.py里,一般就用tikv-ctl命令行来调节)
"minval": 64, # if type!=enum, indicate min possible value
"maxval": 1024, # if type!=enum, indicate max possible value
"enumval": [], # if type==enum, list all valid values
"type": "int", # int / enum / real
"default": 64 # default value
},

以下是一个metric的声明样板:

"write_latency":
{
"read_func": read_write_latency, # 声明查看该指标的函数(函数也定义在controller.py里)
"lessisbetter": 1, # whether less value of this metric is better(1: yes)
"calc": "ins", # ins表示该参数的值就是benchmark之后查看的结果。inc表示该参数是incremental的,需要把benchmark之后和之前的值相减作为结果。
},

一开始的10轮(具体大小可以调节)是用随机生成的knob去benchmark,之后的都是用ML模型推荐的参数去benchmark。

ML模型

AutoTikv 使用了和 OtterTune 一样的高斯过程回归(Gaussian Process Regression,以下简称 GP)来推荐新的 knob,它是基于高斯分布的一种非参数模型。高斯过程回归的好处是:

  1. 和神经网络之类的方法相比,GP 属于无参数模型,算法计算量相对较低,而且在训练样本很少的情况下表现比 NN 更好。
  2. 它能估计样本的分布情况,即 X 的均值 m(X) 和标准差 s(X)。若 X 周围的数据不多,则它被估计出的标准差 s(X) 会偏大(表示这个样本 X 和其他数据点的差异大)。直观的理解是若数据不多,则不确定性会大,体现在标准差偏大。反之,数据足够时,不确定性减少,标准差会偏小。这个特性后面会用到。

但 GP 本身其实只能估计样本的分布,为了得到最终的预测值,我们需要把它应用到贝叶斯优化(Bayesian Optimization)中。贝叶斯优化算法大致可分为两步:

  1. 通过 GP 估计出函数的分布情况
  2. 通过采集函数(Acquisition Function)指导下一步的采样(也就是给出推荐值)

采集函数(Acquisition Function)的作用是:在寻找新的推荐值的时候,平衡探索(exploration)和利用(exploitation)两个性质:

  • exploration:在目前数据量较少的未知区域探索新的点。
  • exploitation:对于数据量足够多的已知区域,利用这些数据训练模型进行估计,找出最优值

在推荐的过程中,需要平衡上述两种指标。exploitation 过多会导致结果陷入局部最优值(重复推荐目前已知的最好的点,但可能还有更好的点没被发现),而 exploration 过多又会导致搜索效率太低(一直在探索新区域,而没有对当前比较好的区域进行深入尝试)。而平衡二者的核心思想是:当数据足够多时,利用现有的数据推荐;当缺少数据时,我们在点最少的区域进行探索,探索最未知的区域能给我们最大的信息量。

贝叶斯优化的第二步就可以帮我们实现这一思想。前面提到 GP 可以帮我们估计 X 的均值 m(X) 和标准差 s(X),其中均值 m(x) 可以作为 exploitation 的表征值,而标准差 s(x) 可以作为 exploration 的表征值。这样就可以用贝叶斯优化方法来求解了。

使用置信区间上界(Upper Confidence Bound)作为采集函数。假设我们需要找 X 使 Y 值尽可能大,则 U(X) = m(X) + k*s(X),其中 k > 0 是可调的系数。我们只要找 X 使 U(X) 尽可能大即可。

  • 若 U(X) 大,则可能 m(X) 大,也可能 s(X) 大。
  • 若 s(X) 大,则说明 X 周围数据不多,需要探索未知区域新的点。
  • 若 m(X) 大,说明估计的 Y 值均值大, 则需要利用已知数据找到效果好的点。
  • 其中系数 k 影响着探索和利用的比例,k 越大,越鼓励探索新的区域。

在具体实现中,一开始随机生成若干个 candidate knobs,然后用上述模型计算出它们的 U(X),找出 U(X) 最大的那一个作为本次推荐的结果。

Ref:https://mp.weixin.qq.com/s/y8VIieK0LO37SjRRyPhtrw

数据库参数

workload

我们定义了writeheavy、longscan、shortscan、point-lookup四种workload。数据库大小都是80GB。

knobs

我们试验了如下参数:

Options

Expected Behavior

valid range/value set

how to set/view knob

 

write-buffer-size

point-lookup, range-scan: larger is better

[64MB, 1GB]

tidb-ansible/conf/tikv.yml

 

max-bytes-for-level-base

point-lookup, range-scan: larger is better

[512MB, 4GB]

tidb-ansible/conf/tikv.yml

 

target-file-size-base

point-lookup, range-scan: larger is better

{8M, 16M, 32M, 64M, 128M}

tidb-ansible/conf/tikv.yml

 

disable-auto-compactions

write-heavy: 1 is better
point-lookup, range-scan: 0 is better

{1, 0}

tidb-ansible/conf/tikv.yml
or
tikv-ctl

 

block-size

point-lookup: smaller the better
range-scan: larger the better

{4k,8k,16k,32k,64k}

tidb-ansible/conf/tikv.yml

 

bloom-filter-bits-per-key

point-lookup, range-scan: larger is better

{5,10,15,20}

tidb-ansible/conf/tikv.yml

 

optimize-filters-for-hits

point-lookup, range-scan: 0 is better

{1, 0}

tidb-ansible/conf/tikv.yml

 

这些参数的含义如下:

  • block-size:RocksDB 会将数据存放在 data block 里面,block-size 设置这些 block 的大小,当需要访问某一个 key 的时候,RocksDB 需要读取这个 key 所在的整个 block。对于点查,更大的 block 会增加读放大,影响性能,但是对于范围查询,更大的 block 能够更有效的利用磁盘带宽。
  • disable-auto-compactions:定义是否关闭 compaction。compaction 会占用磁盘带宽,影响写入速度。但如果 LSM 得不到 compact, level0 文件会累积,影响读性能。其实本身 compaction 也是一个有趣的 auto-tuning 的方向
  • write-buffer-size:单个 memtable 的大小限制(最大值)。理论上说更大的 memtable 会增加二分查找插入位置的消耗,但是之前的初步试验发现这个选项对 writeheavy 影响并不明显。
  • max-bytes-for-level-base:LSM tree 里面 level1 的总大小。在数据量固定的情况下,这个值更大意味着其实 LSM 的层数更小,对读有利。
  • target-file-size-base:假设 target-file-size-multiplier=1 的情况下,这个选项设置的是每个 SST 文件的大小。这个值偏小的话意味着 SST 文件更多,会影响读性能。
  • bloom-filter-bits-per-key:设置 Bloom Filter 的位数。对于读操作这一项越大越好。
  • optimize-filters-for-hits:True 表示关闭 LSM 最底层的 bloom filter。这个选项主要是因为最底层的 bloom filter 总大小比较大,比较占用 block cache 空间。如果已知查询的 key 一定在数据库中存,最底层 bloom filter 其实是没有作用的。

几个试验过但最终放弃了的参数:

  • block_cache_size:RocksDB block cache 的大小,这个 cache 就是用来缓存上面提到的解压缩后的 data block 的。理论上来说 block cache 一般不能占满系统内存,要留一部分用来在 OS buffer cache 里面缓存压缩的 data block。但是在我们初步试验里面 block_cache_size 最优值都是打到最大。针对 block cache 的自动调优策略被研究的也比较多,比如用强化学习来选择置换算法SimulatedCache
  • delayed_write_rate:当 flush 或 compaction 的速度跟不上前台写入速度的时候,RocksDB 会强制将写入速度限制到 delayed_write_rate,来避免读性能退化。本来希望通过调整这个参数来试验能否自动调优这个值,但是发生 write stall 以后会导致 TiKV 返回超时错误,影响 tuning 的流程,所以只好放弃了这个参数。

metrics

我们选择了如下几个metrics作为优化指标。

  • throughput:根据具体workload不同又分为write throughput、get throughput、scan throughput
  • latency:根据具体workload不同又分为write latency、get latency、scan latency
  • store_size:
  • compaction_cpu:

其中throughput和latency通过go-ycsb的输出结果获得,store_size和compaction_cpu通过tikv-ctl获得。

Ref:

https://rdrr.io/github/richfitz/rleveldb/man/leveldb_open.html

http://mysql.taobao.org/monthly/2016/08/03/

https://www.jianshu.com/p/8e0018b6a8b6

https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

实验测试结果

测试平台

AMD Ryzen5-2600(6C12T), 32GB RAM, 512GB NVME SSD, Ubuntu 18.04, tidb-ansible用的master版本

workload=writeheavy    knobs={disable-auto-compactions, block-size}    metric=write_latency

# Copyright (c) 2010 Yahoo! Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you

may not use this file except in compliance with the License. You

may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or

implied. See the License for the specific language governing

permissions and limitations under the License. See accompanying

LICENSE file.

Yahoo! Cloud System Benchmark

Workload A: Update heavy workload

Application example: Session store recording recent actions

Read/update ratio: 50/50

Default data size: 1 KB records (10 fields, 100 bytes each, plus key)

Request distribution: zipfian

80GB

recordcount=80000000
operationcount=5000000

fieldlength=10

workload=core

readallfields=true

readproportion=0
updateproportion=1
scanproportion=0
insertproportion=0

requestdistribution=zipfian

ycsb workload定义

实验效果如下:

################## data ##################
------------------------------previous:------------------------------
knobs:
[[0. 4.]
[1. 3.]
[0. 0.]
[0. 3.]
[0. 1.]
[1. 4.]
[0. 1.]
[1. 0.]
[1. 4.]
[0. 4.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]]
metrics:
[[1.01428000e+04 5.04230000e+04 8.86174709e+10 1.84750000e+02]
[1.01703000e+04 5.02510000e+04 8.98934985e+10 2.50000000e+00]
[1.24102000e+04 4.10920000e+04 8.95223916e+10 2.18850000e+02]
[1.09910000e+04 4.64880000e+04 8.86518967e+10 1.89610000e+02]
[1.20731000e+04 4.21960000e+04 8.90833010e+10 1.88950000e+02]
[9.42460000e+03 5.42690000e+04 8.98143324e+10 3.32000000e+00]
[1.19275000e+04 4.28240000e+04 8.90753594e+10 1.94820000e+02]
[1.18271000e+04 4.32470000e+04 9.11159380e+10 3.08000000e+00]
[9.34830000e+03 5.47160000e+04 8.98211663e+10 3.27000000e+00]
[1.02665000e+04 4.97860000e+04 8.86331145e+10 1.87730000e+02]
[1.25193000e+04 4.08050000e+04 8.94974748e+10 2.19960000e+02]
[1.24805000e+04 4.07670000e+04 8.95419805e+10 2.20190000e+02]
[1.24086000e+04 4.11510000e+04 8.94650026e+10 2.24280000e+02]
[1.21789000e+04 4.18830000e+04 8.95860725e+10 2.18360000e+02]
[1.21835000e+04 4.19280000e+04 8.95094852e+10 2.25200000e+02]
[1.21365000e+04 4.20690000e+04 8.94701087e+10 2.18990000e+02]]
rowlabels: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
num: 16
------------------------------new:------------------------------
knobs: [[0. 0.]]
metrics: [[1.23137000e+04 4.14700000e+04 8.95614611e+10 2.17990000e+02]]
rowlabels: [1]
------------------------------TARGET:------------------------------
knob: ['disable-auto-compactions' 'block-size']
metric: write_latency

metric_lessisbetter: 1

num of knobs == 2
knobs: ['disable-auto-compactions' 'block-size']
num of metrics == 4

metrics: ['write_throughput' 'write_latency' 'store_size' 'compaction_cpu']

################## data ##################

这个实验中推荐结果是启用compaction、同时block size设为4KB。

一开始还挺惊讶的(毕竟按理说写入时关闭 compaction 肯定是提升性能的)。后来分析因为TiKV 里面使用了 Percolator 进行分布式事务,写流程也涉及读操作(写冲突检测),所以关闭 compaction 也导致写入性能下降。同理更小的 block size 提高点查性能,对 TiKV 的写流程性能也有提升。

为了排除这一干扰因素,接下来用point lookup这一纯读取的workload进行了试验:

workload=pntlookup80    knobs={'bloom-filter-bits-per-key', 'optimize-filters-for-hits', 'block-size', 'disable-auto-compactions'}    metric=get_latency

# Copyright (c) 2010 Yahoo! Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you

may not use this file except in compliance with the License. You

may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or

implied. See the License for the specific language governing

permissions and limitations under the License. See accompanying

LICENSE file.

Yahoo! Cloud System Benchmark

Workload C: Read only

Application example: user profile cache, where profiles are constructed elsewhere (e.g., Hadoop)

Read/update ratio: 100/0

Default data size: 1 KB records (10 fields, 100 bytes each, plus key)

Request distribution: zipfian

80GB

2min each run

recordcount=80000000
operationcount=4000000
workload=core

readallfields=true

readproportion=1
updateproportion=0
scanproportion=0
insertproportion=0

requestdistribution=zipfian

ycsb workload定义

实验效果如下:

------------------------------previous:------------------------------
rowlabels, finish_time, knobs, metrics
1 , 2019-08-15 20:12:21 , [2. 0. 2. 0.] , [3.66446000e+04 1.39670000e+04 8.62385543e+10 2.36200000e+01]
2 , 2019-08-15 21:01:30 , [2. 0. 2. 1.] , [2.00085000e+04 2.55740000e+04 8.65226052e+10 0.00000000e+00]
3 , 2019-08-15 22:06:48 , [3. 1. 0. 0.] , [4.18042000e+04 1.22580000e+04 8.68646096e+10 4.99000000e+01]
4 , 2019-08-15 23:12:15 , [0. 1. 1. 0.] , [3.97759000e+04 1.28700000e+04 8.64727843e+10 4.36500000e+01]
5 , 2019-08-16 00:18:39 , [3. 1. 1. 0.] , [4.0698500e+04 1.2577000e+04 8.6412687e+10 4.2540000e+01]
6 , 2019-08-16 01:08:15 , [3. 0. 4. 1.] , [1.75872000e+04 2.90890000e+04 8.63167881e+10 1.80000000e-01]
7 , 2019-08-16 02:13:59 , [2. 1. 0. 0.] , [4.14569000e+04 1.23490000e+04 8.68367156e+10 4.94200000e+01]
8 , 2019-08-16 03:20:14 , [0. 1. 3. 0.] , [3.2892000e+04 1.5563000e+04 8.6045883e+10 4.1360000e+01]
9 , 2019-08-16 04:26:29 , [2. 1. 2. 0.] , [3.56923000e+04 1.43400000e+04 8.61031652e+10 3.95600000e+01]
10 , 2019-08-16 05:32:04 , [1. 0. 0. 0.] , [4.09599000e+04 1.25000000e+04 8.69347684e+10 4.80500000e+01]
11 , 2019-08-16 06:38:25 , [3. 0. 0. 0.] , [4.11105000e+04 1.24550000e+04 8.70293207e+10 4.88900000e+01]
12 , 2019-08-16 07:44:29 , [1. 1. 0. 0.] , [4.18002000e+04 1.22470000e+04 8.68315762e+10 4.95400000e+01]
13 , 2019-08-16 08:50:32 , [2. 0. 0. 0.] , [4.21299000e+04 1.21530000e+04 8.69322719e+10 3.92500000e+01]
14 , 2019-08-16 09:56:32 , [0. 0. 0. 0.] , [3.96365000e+04 1.29120000e+04 8.68696194e+10 5.50400000e+01]
15 , 2019-08-16 11:02:19 , [2. 1. 0. 0.] , [4.13551000e+04 1.23780000e+04 8.68479242e+10 5.01600000e+01]
16 , 2019-08-16 12:08:19 , [0. 1. 0. 0.] , [3.98915000e+04 1.28310000e+04 8.68413685e+10 4.53700000e+01]
17 , 2019-08-16 13:14:13 , [2. 1. 0. 0.] , [4.1778800e+04 1.2253000e+04 8.6845963e+10 4.8780000e+01]
18 , 2019-08-16 14:05:52 , [0. 1. 0. 1.] , [1.37462000e+04 3.72160000e+04 8.74961963e+10 0.00000000e+00]
19 , 2019-08-16 15:11:48 , [2. 1. 1. 0.] , [4.03858000e+04 1.26740000e+04 8.64025255e+10 3.95100000e+01]
20 , 2019-08-16 16:18:06 , [0. 0. 2. 0.] , [3.49978000e+04 1.46240000e+04 8.61336679e+10 2.37300000e+01]
21 , 2019-08-16 17:24:02 , [2. 0. 1. 0.] , [4.13509000e+04 1.23770000e+04 8.65494483e+10 2.70600000e+01]
22 , 2019-08-16 18:29:36 , [3. 1. 0. 0.] , [4.18111000e+04 1.22440000e+04 8.68484968e+10 4.96900000e+01]
23 , 2019-08-16 19:36:16 , [1. 0. 1. 0.] , [4.03078000e+04 1.27000000e+04 8.64872698e+10 3.91300000e+01]
24 , 2019-08-16 20:41:55 , [3. 1. 0. 0.] , [4.26687000e+04 1.19980000e+04 8.68488277e+10 3.38800000e+01]
25 , 2019-08-16 21:47:55 , [2. 0. 0. 0.] , [4.19810000e+04 1.21900000e+04 8.69691844e+10 4.00500000e+01]
26 , 2019-08-16 22:54:13 , [3. 1. 0. 0.] , [4.18609000e+04 1.22290000e+04 8.68388398e+10 5.11200000e+01]
27 , 2019-08-17 00:01:29 , [3. 1. 4. 0.] , [2.9123000e+04 1.7575000e+04 8.6027491e+10 4.3860000e+01]
28 , 2019-08-17 01:07:53 , [2. 0. 0. 0.] , [4.12169000e+04 1.24210000e+04 8.69920328e+10 4.67300000e+01]
29 , 2019-08-17 02:13:38 , [3. 1. 0. 0.] , [4.18402000e+04 1.22350000e+04 8.68513516e+10 4.57200000e+01]
30 , 2019-08-17 03:19:31 , [2. 0. 0. 0.] , [4.20812000e+04 1.21640000e+04 8.69824656e+10 4.01500000e+01]
31 , 2019-08-17 04:25:12 , [3. 1. 0. 0.] , [4.16913000e+04 1.22760000e+04 8.68498155e+10 4.98100000e+01]
32 , 2019-08-17 05:31:00 , [3. 0. 0. 0.] , [4.15515000e+04 1.23180000e+04 8.70275493e+10 4.94400000e+01]
33 , 2019-08-17 06:37:15 , [3. 1. 0. 0.] , [4.16460000e+04 1.22920000e+04 8.68442154e+10 4.66100000e+01]
34 , 2019-08-17 07:43:24 , [3. 0. 0. 0.] , [4.22696000e+04 1.21100000e+04 8.70264613e+10 3.65300000e+01]
35 , 2019-08-17 08:49:24 , [3. 1. 0. 0.] , [4.18575000e+04 1.22280000e+04 8.68419002e+10 4.99000000e+01]
36 , 2019-08-17 09:55:36 , [3. 0. 0. 0.] , [4.07931000e+04 1.25500000e+04 8.70300743e+10 4.98500000e+01]
37 , 2019-08-17 11:00:54 , [3. 1. 0. 0.] , [4.19244000e+04 1.22080000e+04 8.68508093e+10 4.98500000e+01]
38 , 2019-08-17 12:06:37 , [3. 0. 0. 0.] , [4.1197800e+04 1.2425000e+04 8.7020173e+10 4.6780000e+01]
39 , 2019-08-17 13:12:35 , [3. 1. 0. 0.] , [4.19859000e+04 1.21920000e+04 8.68462752e+10 4.20200000e+01]
40 , 2019-08-17 14:18:12 , [3. 0. 0. 0.] , [4.09505000e+04 1.25020000e+04 8.70206609e+10 5.18800000e+01]
41 , 2019-08-17 15:23:32 , [3. 1. 0. 0.] , [4.19558000e+04 1.22030000e+04 8.68409963e+10 4.25600000e+01]
42 , 2019-08-17 16:29:22 , [3. 0. 0. 0.] , [4.15804000e+04 1.23100000e+04 8.70172108e+10 4.56500000e+01]
43 , 2019-08-17 17:35:13 , [3. 1. 0. 0.] , [4.16524000e+04 1.22890000e+04 8.68602952e+10 4.62100000e+01]
44 , 2019-08-17 18:41:04 , [3. 0. 0. 0.] , [4.09697000e+04 1.24950000e+04 8.70105798e+10 4.56000000e+01]
45 , 2019-08-17 19:46:55 , [3. 1. 0. 0.] , [4.16999000e+04 1.22770000e+04 8.68411373e+10 4.83400000e+01]
46 , 2019-08-17 20:52:48 , [3. 0. 0. 0.] , [4.11311000e+04 1.24450000e+04 8.70303738e+10 4.90000000e+01]
47 , 2019-08-17 21:58:48 , [3. 1. 0. 0.] , [4.23772000e+04 1.20780000e+04 8.68478265e+10 3.74500000e+01]
48 , 2019-08-17 23:04:49 , [3. 0. 0. 0.] , [4.12347000e+04 1.24120000e+04 8.70284529e+10 3.89000000e+01]
49 , 2019-08-18 00:10:42 , [3. 1. 0. 0.] , [4.29264000e+04 1.19250000e+04 8.68530475e+10 3.23300000e+01]
50 , 2019-08-18 01:16:15 , [3. 0. 0. 0.] , [4.15186000e+04 1.23290000e+04 8.70386584e+10 3.65400000e+01]
51 , 2019-08-18 02:21:36 , [3. 1. 0. 0.] , [4.26975000e+04 1.19900000e+04 8.68521299e+10 4.03900000e+01]
52 , 2019-08-18 03:27:19 , [3. 0. 0. 0.] , [4.08752000e+04 1.25230000e+04 8.70437235e+10 4.79600000e+01]
------------------------------new:------------------------------
knobs: [[3. 1. 0. 0.]]
metrics: [[4.21738000e+04 1.21390000e+04 8.68461987e+10 4.58900000e+01]]
rowlabels: [1]
timestamp: 2019-08-18 04:33:07
------------------------------TARGET:------------------------------
knob: ['bloom-filter-bits-per-key' 'optimize-filters-for-hits' 'block-size' 'disable-auto-compactions']
metric: get_latency

metric_lessisbetter: 1

num of knobs == 4
knobs: ['bloom-filter-bits-per-key' 'optimize-filters-for-hits' 'block-size' 'disable-auto-compactions']
num of metrics == 4

metrics: ['get_throughput' 'get_latency' 'store_size' 'compaction_cpu']

推荐结果为:bloom-filter-bits-per-key=20,block-size=4K,不disable auto compaction。而optimize-filters-for-hits是否启用影响不大(所以会出现这一项的推荐结果一直在摇摆的情况)。

推荐的结果都挺符合预期的。关于 optimize-filter 这一项,应该是试验里面 block cache 足够大,所以 bloom filter 大小对 cache 性能影响不大;而且我们是设置 default CF 相应的选项,而对于 TiKV 来说查询 default CF 之前我们已经确定相应的 key 肯定存在,所以是否有 filter 并没有影响。之后的试验中我们会设置 writeCF 中的 optimize-filters-for-hits(defaultCF的这一项默认就是0了);然后分别设置 defaultCF 和 writeCF 中的 bloom-filter-bits-per-key,把它们作为两个 knob。

workload=pntlookup80    knobs={rocksdb.writecf.bloom-filter-bits-per-key,  rocksdb.defaultcf.bloom-filter-bits-per-key,  rocksdb.writecf.optimize-filters-for-hits,  rocksdb.defaultcf.block-size,  rocksdb.defaultcf.disable-auto-compactions}    metric=get_throughput

为了能尽量测出来 bloom filter 的效果,除了上述改动之外,我们把 workload 也改了一下:把 run phase 的 recordcount 设成 load phase 的两倍大,这样强制有一半的查找对应的 key 不存在,这样应该会测出来 write CF 的 optimize-filters-for-hits 必须关闭。改完之后的 workload 如下:

# Copyright (c) 2010 Yahoo! Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you

may not use this file except in compliance with the License. You

may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or

implied. See the License for the specific language governing

permissions and limitations under the License. See accompanying

LICENSE file.

Yahoo! Cloud System Benchmark

Workload C: Read only

Application example: user profile cache, where profiles are constructed elsewhere (e.g., Hadoop)

Read/update ratio: 100/0

Default data size: 1 KB records (10 fields, 100 bytes each, plus key)

Request distribution: zipfian

80GB

2min each run

recordcount=80000000
operationcount=5000000
workload=core

readallfields=true

readproportion=1
updateproportion=0
scanproportion=0
insertproportion=0

requestdistribution=zipfian

Load, db大小是80GB

# Copyright (c) 2010 Yahoo! Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you

may not use this file except in compliance with the License. You

may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or

implied. See the License for the specific language governing

permissions and limitations under the License. See accompanying

LICENSE file.

Yahoo! Cloud System Benchmark

Workload C: Read only

Application example: user profile cache, where profiles are constructed elsewhere (e.g., Hadoop)

Read/update ratio: 100/0

Default data size: 1 KB records (10 fields, 100 bytes each, plus key)

Request distribution: zipfian

80GB

2min each run

recordcount=160000000
operationcount=5000000
workload=core

readallfields=true

readproportion=1
updateproportion=0
scanproportion=0
insertproportion=0

requestdistribution=zipfian

Run, db大小是160GB

这次的实验效果如下(发现一个很出乎意料的现象喔):

################## data ##################
------------------------------previous:------------------------------
rowlabels, finish_time, knobs, metrics
1 , 2019-08-21 20:11:14 , [3. 0. 0. 1. 0.] , [4.17141000e+04 1.22700000e+04 8.63951981e+10 3.22600000e+01]
2 , 2019-08-21 21:01:50 , [1. 2. 1. 0. 1.] , [1.84393000e+04 2.77530000e+04 8.76023557e+10 0.00000000e+00]
3 , 2019-08-21 21:52:36 , [3. 2. 1. 2. 1.] , [1.7525800e+04 2.9193000e+04 8.6489484e+10 0.0000000e+00]
4 , 2019-08-21 22:57:46 , [3. 2. 1. 4. 0.] , [3.1377400e+04 1.6311000e+04 8.6011209e+10 3.3480000e+01]
5 , 2019-08-22 00:03:06 , [2. 2. 1. 3. 0.] , [3.57383000e+04 1.43210000e+04 8.60386882e+10 3.97000000e+01]
6 , 2019-08-22 00:53:29 , [2. 3. 1. 3. 1.] , [1.78001000e+04 2.87420000e+04 8.64059038e+10 0.00000000e+00]
7 , 2019-08-22 01:58:09 , [2. 3. 1. 1. 0.] , [4.29348000e+04 1.19210000e+04 8.64341341e+10 3.46900000e+01]
8 , 2019-08-22 03:03:35 , [1. 1. 1. 2. 0.] , [3.66121000e+04 1.39810000e+04 8.61325773e+10 4.39700000e+01]
9 , 2019-08-22 04:09:05 , [2. 2. 0. 3. 0.] , [3.59254000e+04 1.42450000e+04 8.60441991e+10 4.09100000e+01]
10 , 2019-08-22 05:14:04 , [2. 2. 1. 0. 0.] , [4.3455900e+04 1.1779000e+04 8.6827261e+10 4.7180000e+01]
11 , 2019-08-22 06:19:28 , [2. 1. 0. 1. 0.] , [4.32743000e+04 1.18270000e+04 8.64087651e+10 2.76900000e+01]
12 , 2019-08-22 07:25:29 , [2. 2. 1. 0. 0.] , [4.37505000e+04 1.16980000e+04 8.68377817e+10 3.66300000e+01]
13 , 2019-08-22 08:30:47 , [2. 0. 0. 1. 0.] , [4.09163000e+04 1.25110000e+04 8.63941229e+10 4.32500000e+01]
14 , 2019-08-22 09:36:14 , [2. 2. 0. 0. 0.] , [4.36414000e+04 1.17270000e+04 8.68246281e+10 4.70600000e+01]
15 , 2019-08-22 10:41:49 , [2. 1. 1. 0. 0.] , [4.29599000e+04 1.19140000e+04 8.68228424e+10 4.91600000e+01]
16 , 2019-08-22 11:47:07 , [2. 1. 0. 0. 0.] , [4.28121000e+04 1.19560000e+04 8.68404965e+10 4.76900000e+01]
17 , 2019-08-22 12:53:03 , [2. 2. 1. 0. 0.] , [4.33225000e+04 1.18150000e+04 8.68531672e+10 4.67000000e+01]
18 , 2019-08-22 13:58:13 , [3. 1. 0. 0. 0.] , [4.42762000e+04 1.15600000e+04 8.68428438e+10 3.25600000e+01]
19 , 2019-08-22 15:03:52 , [2. 2. 1. 0. 0.] , [4.43796000e+04 1.15330000e+04 8.68332426e+10 3.37300000e+01]
20 , 2019-08-22 16:09:26 , [3. 1. 0. 0. 0.] , [4.24397000e+04 1.20590000e+04 8.68403016e+10 5.07300000e+01]
21 , 2019-08-22 17:14:34 , [2. 2. 1. 0. 0.] , [4.35737000e+04 1.17460000e+04 8.68471932e+10 4.73400000e+01]
22 , 2019-08-22 18:19:47 , [3. 1. 0. 0. 0.] , [4.28986000e+04 1.19310000e+04 8.68300705e+10 4.80600000e+01]
23 , 2019-08-22 19:25:22 , [2. 2. 1. 0. 0.] , [4.34617000e+04 1.17780000e+04 8.68395239e+10 4.80400000e+01]
24 , 2019-08-22 20:31:11 , [2. 1. 0. 0. 0.] , [4.32535000e+04 1.18330000e+04 8.68426298e+10 4.46100000e+01]
25 , 2019-08-22 21:36:29 , [3. 2. 1. 0. 0.] , [4.30494000e+04 1.18900000e+04 8.68364294e+10 4.78600000e+01]
26 , 2019-08-22 22:42:20 , [2. 1. 0. 0. 0.] , [4.27872000e+04 1.19630000e+04 8.68309331e+10 4.76100000e+01]
27 , 2019-08-22 23:47:42 , [3. 2. 0. 0. 0.] , [4.32865000e+04 1.18250000e+04 8.68361102e+10 4.83400000e+01]
28 , 2019-08-23 00:53:08 , [2. 1. 1. 0. 0.] , [4.29929000e+04 1.19080000e+04 8.68338814e+10 5.06200000e+01]
29 , 2019-08-23 01:58:37 , [2. 2. 0. 0. 0.] , [4.36637000e+04 1.17220000e+04 8.67981041e+10 4.49300000e+01]
30 , 2019-08-23 03:03:42 , [3. 1. 1. 0. 0.] , [4.30542000e+04 1.18890000e+04 8.68628124e+10 5.10200000e+01]
31 , 2019-08-23 04:09:01 , [2. 2. 0. 0. 0.] , [4.31552000e+04 1.18600000e+04 8.68568929e+10 5.26200000e+01]
32 , 2019-08-23 05:13:59 , [3. 1. 1. 0. 0.] , [4.29512000e+04 1.19180000e+04 8.68360587e+10 5.17800000e+01]
33 , 2019-08-23 06:19:15 , [2. 2. 0. 0. 0.] , [4.34998000e+04 1.17670000e+04 8.68505644e+10 4.75000000e+01]
34 , 2019-08-23 07:24:36 , [3. 1. 1. 0. 0.] , [4.29066000e+04 1.19310000e+04 8.68417278e+10 4.94600000e+01]
35 , 2019-08-23 08:30:13 , [2. 2. 0. 0. 0.] , [4.37385000e+04 1.17030000e+04 8.68307716e+10 4.26100000e+01]
36 , 2019-08-23 09:34:58 , [3. 1. 1. 0. 0.] , [4.29117000e+04 1.19300000e+04 8.68479672e+10 4.71600000e+01]
37 , 2019-08-23 10:40:21 , [2. 2. 0. 0. 0.] , [4.30777000e+04 1.18810000e+04 8.68356132e+10 4.95800000e+01]
38 , 2019-08-23 11:45:43 , [3. 1. 1. 0. 0.] , [4.36291000e+04 1.17310000e+04 8.68428416e+10 4.08700000e+01]
39 , 2019-08-23 12:51:25 , [2. 2. 0. 0. 0.] , [4.36237000e+04 1.17360000e+04 8.68353864e+10 4.00500000e+01]
40 , 2019-08-23 13:57:10 , [3. 1. 1. 0. 0.] , [4.39189000e+04 1.16570000e+04 8.68385229e+10 3.60400000e+01]
------------------------------new:------------------------------
knobs: [[2. 2. 0. 0. 0.]]
metrics: [[4.36609000e+04 1.17230000e+04 8.68364011e+10 4.77100000e+01]]
rowlabels: [1]
timestamp: 2019-08-23 15:02:11
------------------------------TARGET:------------------------------
knob: ['rocksdb.writecf.bloom-filter-bits-per-key'
'rocksdb.defaultcf.bloom-filter-bits-per-key'
'rocksdb.writecf.optimize-filters-for-hits'
'rocksdb.defaultcf.block-size'
'rocksdb.defaultcf.disable-auto-compactions']
metric: get_throughput

metric_lessisbetter: 0

num of knobs == 5
knobs: ['rocksdb.writecf.bloom-filter-bits-per-key'
'rocksdb.defaultcf.bloom-filter-bits-per-key'
'rocksdb.writecf.optimize-filters-for-hits'
'rocksdb.defaultcf.block-size'
'rocksdb.defaultcf.disable-auto-compactions']
num of metrics == 4

metrics: ['get_throughput' 'get_latency' 'store_size' 'compaction_cpu']

################## data ##################

测出来发现推荐配置基本集中在以下两种:

  • {3,1,1,0,0}

    • rocksdb.writecf.bloom-filter-bits-per-key ['rocksdb', 'writecf'] bloom-filter-bits-per-key 20
      rocksdb.defaultcf.bloom-filter-bits-per-key ['rocksdb', 'defaultcf'] bloom-filter-bits-per-key 10
      rocksdb.writecf.optimize-filters-for-hits ['rocksdb', 'writecf'] optimize-filters-for-hits True
      rocksdb.defaultcf.block-size ['rocksdb', 'defaultcf'] block-size 4KB
      rocksdb.defaultcf.disable-auto-compactions ['rocksdb', 'defaultcf'] disable-auto-compactions False
  • {2,2,0,0,0}

    • rocksdb.writecf.bloom-filter-bits-per-key ['rocksdb', 'writecf'] bloom-filter-bits-per-key 15
      rocksdb.defaultcf.bloom-filter-bits-per-key ['rocksdb', 'defaultcf'] bloom-filter-bits-per-key 15
      rocksdb.writecf.optimize-filters-for-hits ['rocksdb', 'writecf'] optimize-filters-for-hits False
      rocksdb.defaultcf.block-size ['rocksdb', 'defaultcf'] block-size 4KB
      rocksdb.defaultcf.disable-auto-compactions ['rocksdb', 'defaultcf'] disable-auto-compactions False

分析了一下,感觉是因为 write CF 比较小,当 block cache size 足够大时,bloom filter 的效果可能就不很明显了。

如果仔细看一下结果,比较如下两个sample,会有个很神奇的发现:

30 , 2019-08-23 03:03:42 , [3. 1. 1. 0. 0.] , [4.30542000e+04 1.18890000e+04 8.68628124e+10 5.10200000e+01]
20 , 2019-08-22 16:09:26 , [3. 1. 0. 0. 0.] , [4.24397000e+04 1.20590000e+04 8.68403016e+10 5.07300000e+01]

它们 knob 的唯一区别就是 30 号关闭了底层 bloom filter(optimize-filters-for-hits==True),20 号启用了底层 bloom filter(optimize-filters-for-hits==False)。结果 20 号的 throughput 比 30 还低了一点,和预期完全不一样。于是我们打开 grafana 琢磨了一下,分别截取了这两个 sample 运行时段的图表:

(两种场景run时候的block-cache-size都是12.8GB,篇幅有限就不截那部分的图了)

图中粉色竖线左边是 load 阶段,右边是 run 阶段。可以看出来这俩情况下 cache hit 其实相差不大,而且 20 号还稍微低一点点。这种情况是因为 bloom filter 本身也是占空间的,如果本来 block cache size 够用,但 bloom filter 占空间又比较大,就会影响 cache hit。这个一开始确实没有预料到。其实这是一个好事情,说明 ML 模型确实可以帮我们发现一些人工想不到的东西。

接下来再试验一下short range scan。这次要优化的metric改成scan latency

workload=shortscan    knobs={'bloom-filter-bits-per-key', 'optimize-filters-for-hits', 'block-size', 'disable-auto-compactions'}    metric=scan_latency

# Copyright (c) 2010 Yahoo! Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you

may not use this file except in compliance with the License. You

may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or

implied. See the License for the specific language governing

permissions and limitations under the License. See accompanying

LICENSE file.

Yahoo! Cloud System Benchmark

Workload E: Short ranges

Application example: threaded conversations, where each scan is for the posts in a given thread (assumed to be clustered by thread id)

Scan/insert ratio: 95/5

Default data size: 1 KB records (10 fields, 100 bytes each, plus key)

Request distribution: zipfian

The insert order is hashed, not ordered. Although the scans are ordered, it does not necessarily

follow that the data is inserted in order. For example, posts for thread 342 may not be inserted contiguously, but

instead interspersed with posts from lots of other threads. The way the YCSB client works is that it will pick a start

key, and then request a number of records; this works fine even for hashed insertion.

recordcount=80000000
operationcount=200000
workload=core

readallfields=true

readproportion=0
updateproportion=0
scanproportion=1
insertproportion=0

requestdistribution=uniform

minscanlength=100
maxscanlength=100

scanlengthdistribution=uniform

shortscan workload定义

实验结果如下:

################## data ##################
------------------------------previous:------------------------------
rowlabels, finish_time, knobs, metrics
1 , 2019-08-24 18:29:05 , [1. 1. 1. 2. 1.] , [6.72800000e+02 7.53744000e+05 8.64420017e+10 2.40000000e-01]
2 , 2019-08-24 19:20:03 , [0. 3. 1. 3. 1.] , [6.1490000e+02 8.2401700e+05 8.6410917e+10 0.0000000e+00]
3 , 2019-08-24 20:10:54 , [3. 0. 0. 0. 1.] , [6.64200000e+02 7.62370000e+05 8.74716093e+10 0.00000000e+00]
4 , 2019-08-24 21:14:30 , [0. 1. 0. 1. 0.] , [4.05440000e+03 1.25855000e+05 8.64184132e+10 2.80100000e+01]
5 , 2019-08-24 22:18:31 , [2. 1. 0. 3. 0.] , [4.23970000e+03 1.20196000e+05 8.60256954e+10 3.74100000e+01]
6 , 2019-08-24 23:08:55 , [0. 0. 0. 1. 1.] , [7.07000000e+02 7.16597000e+05 8.68539722e+10 0.00000000e+00]
7 , 2019-08-25 00:12:24 , [2. 1. 0. 3. 0.] , [4.5478000e+03 1.1218900e+05 8.6033236e+10 2.5120000e+01]
8 , 2019-08-25 01:04:55 , [1. 3. 1. 4. 1.] , [4.96200000e+02 1.02048400e+06 8.63227618e+10 0.00000000e+00]
9 , 2019-08-25 01:56:06 , [3. 3. 1. 0. 1.] , [6.6310000e+02 7.6451400e+05 8.7654137e+10 0.0000000e+00]
10 , 2019-08-25 02:47:01 , [3. 3. 1. 2. 1.] , [6.66900000e+02 7.60646000e+05 8.65341307e+10 0.00000000e+00]
11 , 2019-08-25 03:51:18 , [1. 1. 0. 2. 0.] , [4.19610000e+03 1.21614000e+05 8.60931486e+10 2.51200000e+01]
12 , 2019-08-25 04:55:47 , [2. 0. 0. 3. 0.] , [4.3978000e+03 1.1592900e+05 8.6036505e+10 3.6290000e+01]
13 , 2019-08-25 05:59:51 , [1. 1. 0. 3. 0.] , [4.35150000e+03 1.17180000e+05 8.60368063e+10 3.63800000e+01]
14 , 2019-08-25 07:03:58 , [2. 0. 0. 2. 0.] , [3.77810000e+03 1.35018000e+05 8.60859856e+10 3.57900000e+01]
15 , 2019-08-25 08:07:51 , [1. 0. 0. 3. 0.] , [4.66590000e+03 1.09339000e+05 8.60241768e+10 2.76200000e+01]
16 , 2019-08-25 09:11:58 , [2. 1. 0. 2. 0.] , [4.09160000e+03 1.24662000e+05 8.60801061e+10 2.85700000e+01]
17 , 2019-08-25 10:16:10 , [0. 0. 0. 2. 0.] , [4.05350000e+03 1.25774000e+05 8.60802488e+10 2.62900000e+01]
18 , 2019-08-25 11:20:09 , [1. 0. 0. 3. 0.] , [4.68850000e+03 1.08877000e+05 8.59966196e+10 2.37400000e+01]
19 , 2019-08-25 12:24:28 , [0. 2. 0. 2. 0.] , [4.25840000e+03 1.19757000e+05 8.60873241e+10 2.42100000e+01]
20 , 2019-08-25 13:29:06 , [1. 0. 0. 2. 0.] , [3.77300000e+03 1.35303000e+05 8.60943509e+10 3.77800000e+01]
21 , 2019-08-25 14:33:43 , [0. 1. 0. 3. 0.] , [4.67830000e+03 1.09096000e+05 8.60373353e+10 2.58500000e+01]
22 , 2019-08-25 15:37:49 , [1. 0. 0. 3. 0.] , [4.72760000e+03 1.07929000e+05 8.60229122e+10 2.41700000e+01]
23 , 2019-08-25 16:42:13 , [0. 1. 0. 2. 0.] , [3.83190000e+03 1.33200000e+05 8.61015852e+10 3.75200000e+01]
24 , 2019-08-25 17:46:31 , [0. 0. 0. 4. 0.] , [4.80830000e+03 1.06059000e+05 8.59515848e+10 3.18500000e+01]
25 , 2019-08-25 18:50:39 , [1. 0. 0. 3. 0.] , [4.51200000e+03 1.13177000e+05 8.60177759e+10 3.22500000e+01]
26 , 2019-08-25 19:54:26 , [0. 2. 0. 4. 0.] , [4.86770000e+03 1.04802000e+05 8.59837067e+10 3.25800000e+01]
27 , 2019-08-25 20:58:22 , [1. 0. 0. 4. 0.] , [4.9614000e+03 1.0285500e+05 8.5950186e+10 3.1870000e+01]
28 , 2019-08-25 22:02:31 , [0. 0. 0. 3. 0.] , [4.37540000e+03 1.16648000e+05 8.60301063e+10 3.36500000e+01]
29 , 2019-08-25 23:06:31 , [1. 2. 0. 4. 0.] , [4.95800000e+03 1.03017000e+05 8.60147679e+10 3.06400000e+01]
30 , 2019-08-26 00:10:15 , [1. 0. 0. 4. 0.] , [5.20820000e+03 9.80490000e+04 8.59992036e+10 3.10200000e+01]
31 , 2019-08-26 01:09:36 , [1. 3. 0. 3. 0.] , [4.63750000e+03 1.10141000e+05 8.60371023e+10 3.01500000e+01]
32 , 2019-08-26 02:10:54 , [1. 1. 0. 4. 0.] , [4.89860000e+03 1.04158000e+05 8.59848252e+10 3.12700000e+01]
33 , 2019-08-26 03:12:48 , [1. 0. 0. 3. 0.] , [4.54700000e+03 1.12233000e+05 8.60197859e+10 3.15300000e+01]
34 , 2019-08-26 04:15:28 , [2. 2. 0. 4. 0.] , [4.95670000e+03 1.02892000e+05 8.60205523e+10 3.21900000e+01]
35 , 2019-08-26 05:18:03 , [1. 0. 0. 4. 0.] , [4.82490000e+03 1.05684000e+05 8.59840325e+10 3.27900000e+01]
36 , 2019-08-26 06:20:38 , [3. 1. 0. 4. 0.] , [4.98140000e+03 1.02350000e+05 8.59992772e+10 3.16700000e+01]
37 , 2019-08-26 07:23:21 , [1. 0. 0. 4. 0.] , [4.97320000e+03 1.02554000e+05 8.59940724e+10 3.17100000e+01]
38 , 2019-08-26 08:26:04 , [3. 3. 0. 3. 0.] , [4.59460000e+03 1.11100000e+05 8.60488145e+10 2.85000000e+01]
39 , 2019-08-26 09:28:30 , [2. 0. 0. 4. 0.] , [4.85840000e+03 1.05104000e+05 8.59982211e+10 3.17800000e+01]
40 , 2019-08-26 10:31:31 , [2. 3. 0. 2. 0.] , [4.13200000e+03 1.23462000e+05 8.61029034e+10 2.78400000e+01]
41 , 2019-08-26 11:35:06 , [1. 0. 0. 4. 0.] , [5.00720000e+03 1.01956000e+05 8.60064623e+10 3.17800000e+01]
42 , 2019-08-26 12:38:18 , [3. 0. 0. 4. 0.] , [4.87100000e+03 1.04930000e+05 8.59962461e+10 3.14800000e+01]
43 , 2019-08-26 13:41:29 , [1. 0. 0. 4. 0.] , [4.9381000e+03 1.0334100e+05 8.6066299e+10 3.2380000e+01]
44 , 2019-08-26 14:44:25 , [2. 1. 0. 4. 0.] , [5.01210000e+03 1.01852000e+05 8.59967147e+10 3.18600000e+01]
45 , 2019-08-26 15:47:21 , [1. 0. 0. 4. 0.] , [4.86200000e+03 1.04912000e+05 8.60001832e+10 3.25000000e+01]
------------------------------new:------------------------------
knobs: [[3. 0. 1. 4. 0.]]
metrics: [[5.02470000e+03 1.01642000e+05 8.59832276e+10 3.08800000e+01]]
rowlabels: [1]
timestamp: 2019-08-26 16:50:32
------------------------------TARGET:------------------------------
knob: ['rocksdb.writecf.bloom-filter-bits-per-key'
'rocksdb.defaultcf.bloom-filter-bits-per-key'
'rocksdb.writecf.optimize-filters-for-hits'
'rocksdb.defaultcf.block-size'
'rocksdb.defaultcf.disable-auto-compactions']
metric: scan_latency

metric_lessisbetter: 1

num of knobs == 5
knobs: ['rocksdb.writecf.bloom-filter-bits-per-key'
'rocksdb.defaultcf.bloom-filter-bits-per-key'
'rocksdb.writecf.optimize-filters-for-hits'
'rocksdb.defaultcf.block-size'
'rocksdb.defaultcf.disable-auto-compactions']
num of metrics == 4

metrics: ['scan_throughput' 'scan_latency' 'store_size' 'compaction_cpu']

################## data ##################

由于时间有限我们先看前 45 轮的结果。这个推荐结果还没有完全收敛,但基本上满足optimize-filters-for-hits==False,block-size==32KB 或者 64KB,disable-auto-compactions==False,这三个也是对结果影响最明显的参数了。根据 Intel 的 SSD 白皮书,SSD 对 32KB 和 64KB 大小的随机读性能其实差不多。bloom filter 的位数对 scan 操作的影响也不大。这个实验结果也是符合预期了。

之后我们还会测试long scan的结果

Ref:

与OtterTune的不同点

我们的试验场景和 OtterTune 还是有一些区别的,主要集中在以下几点:

  • AutoTikv 直接和 DB 运行在同一台机器上,而不是像 OtterTune 一样设置一个集中式的训练服务器。但其实这样并不会占用很多资源,还避免了不同机器配置不一样造成数据不一致的问题。
  • 省去了 workload mapping(OtterTune 加了这一步来从 repository 中挑出和当前 workload 最像的训练样本,而我们目前默认 workload 类型只有一种)
  • 要调的 knobs 比较少,省去了 identity important knobs(OtterTune 是通过 Lasso Regression 选出 10 个最重要的 knob 进行调优)
  • 另外我们重构了 OtterTune 的架构,减少了对具体数据库系统的耦合度。更方便将整个模型和 pipeline 移植到其他系统上(只需修改 controller.py 中具体操作数据库系统的语句即可,其它都不用修改),也更适合比起 SQL 更加轻量的 KV 数据库。
  • 最后我们顺手解决了 OtterTune 中只能调整 global knob,无法调节不同 session 中同名 knob 的问题。

Ref:正式开始编码之前对OtterTune的一些详细解析

一些扩展思路

由于时间有限,这里只实现了一些很基础的功能。围绕这个模型还有很多可以扩展的地方。这里记录几个扩展思路:

  • Q:如何动态适应不断变化的 workload?(比如一会读一会写)
  • A:可以根据训练样本的 timestamp 设置一个阈值,很久远的就丢弃掉
  • Q:有时候 ML 模型有可能陷入局部最优(尝试的 knob 组合不全,限于若干个当前效果还不错的 knob 循环推荐了)
  • A:前面讲过在贝叶斯优化中,exploration 和 exploitation 有一个 trade off,是用一个系数 k 决定的。后面会尝试调节这个系数。
  • 目前对于 enum 类型的 knob,在 ML model 里是以离散化后的数值的形式存储的(比如 0, 1, 2, 3)。如果后面出现了没有明确大小关系的 enum 类型 knob,需要改成 one-hot 的类型。

总结

一个复杂的系统需要很多环节的取舍和平衡,才能使得总体运行效果达到最好。这需要对整个系统各个环节都有很深入的理解。调试 AutoTikv 的时候也发现过很多参数设置的结果并不符合预期的情况,后来仔细分析了 grafana 中的图表才发现其中的一些门路:

  • 有些参数对结果的影响并没有很大。比如这个参数起作用的场景根本没有触发,或者说和它相关的硬件并没有出现性能瓶颈
  • 有些参数直接动态调整是达不到效果的,或者需要跑足够长时间的 workload 才能看出效果。例如 block cache size 刚从小改大的一小段时间肯定是装不满的,必须要等 workload 足够把它填满之后,才能看出大缓存对总体 cache hit 的提升效果
  • 有些参数的效果和预期相反,分析了发现该参数其实是有副作用的,在某些场景下就不大行了(比如上面的 bloom filter 那个例子)
  • 有些 workload 并不是完全的读或者写,还会掺杂一些别的操作。而人工判断预期效果的时候很可能忽略这一点(比如上面的 writeheavy)。特别是在实际生产环境中,DBA 并不能提前知道会遇到什么样的 workload。这大概也就是自动调优的作用吧

Ref:

贝叶斯优化

https://blog.csdn.net/Leon_winter/article/details/86604553

https://blog.csdn.net/a769096214/article/details/80920304

https://docs.google.com/document/d/1raibF5LLmmYvfYo8rMK_TP4EJPDj2RzlSZFp1a3ligU/edit?ts=5ce5c60a#heading=h.losu3j60zo6r