8. Ceph 基础篇 - 运维常用操作
阅读原文时间:2023年07月09日阅读:2

文章转载自:https://mp.weixin.qq.com/s?__biz=MzI1MDgwNzQ1MQ==&mid=2247485300&idx=1&sn=aacff9f7be24a68e0dacdebb03809828&chksm=e9fdd280de8a5b961c994219006f73b94b4b754852f30aee0b233507133d8abaa9112b58ef21&scene=178&cur_album_id=1600845417376776197#rd

Ceph 组件服务停启

Ceph 存储集群如果部署在 Centos 7 上面,可以使用 RUNNING CEPH WITH SYSTEMD 方式进行停启操作,如果部署在 Ubuntu 系统可以使用 RUNNING CEPH WITH UPSTART,还可以使用 RUNNING CEPH WITH SYSVINIT的方式进行管理集群中各种组件的服务进程。每种停启方式,可以按不同的粒度进行管理,官网:https://docs.ceph.com/en/latest/rados/operations/operating/,由于这里使用的是Centos 7系统,下面只给出 Centos7 的操作。

RUNNING CEPH WITH SYSTEMD

For all distributions that support systemd (CentOS 7, Fedora, Debian Jessie 8 and later, SUSE), ceph daemons are now managed using native systemd files instead of the legacy sysvinit scripts. For example:

sudo systemctl start ceph.target       # start all daemons
sudo systemctl status ceph-osd@12      # check status of osd.12

To list the Ceph systemd units on a node, execute:

sudo systemctl status ceph\*.service ceph\*.target

STARTING ALL DAEMONS

To start all daemons on a Ceph Node (irrespective of type), execute the following:

sudo systemctl start ceph.target

STOPPING ALL DAEMONS

To stop all daemons on a Ceph Node (irrespective of type), execute the following:

sudo systemctl stop ceph\*.service ceph\*.target

STARTING ALL DAEMONS BY TYPE

To start all daemons of a particular type on a Ceph Node, execute one of the following:

sudo systemctl start ceph-osd.target
sudo systemctl start ceph-mon.target
sudo systemctl start ceph-mds.target

STOPPING ALL DAEMONS BY TYPE

To stop all daemons of a particular type on a Ceph Node, execute one of the following:

sudo systemctl stop ceph-mon\*.service ceph-mon.target
sudo systemctl stop ceph-osd\*.service ceph-osd.target
sudo systemctl stop ceph-mds\*.service ceph-mds.target

STARTING A DAEMON

To start a specific daemon instance on a Ceph Node, execute one of the following:

sudo systemctl start ceph-osd@{id}
sudo systemctl start ceph-mon@{hostname}
sudo systemctl start ceph-mds@{hostname}

For example:

sudo systemctl start ceph-osd@1
sudo systemctl start ceph-mon@ceph-server
sudo systemctl start ceph-mds@ceph-server

STOPPING A DAEMON

To stop a specific daemon instance on a Ceph Node, execute one of the following:

sudo systemctl stop ceph-osd@{id}
sudo systemctl stop ceph-mon@{hostname}
sudo systemctl stop ceph-mds@{hostname}

For example:

sudo systemctl stop ceph-osd@1
sudo systemctl stop ceph-mon@ceph-server
sudo systemctl stop ceph-mds@ceph-server

Ceph 组件服务日志

在排查问题时,会经常查一些日志,那么,默认情况下 Ceph 组件的日志在什么地方呢?在/var/log/ceph 目录,也可以在配置文件中自定义日志路径;

1.模拟出错,停止 osd

[root@ceph-node01 ceph]# systemctl stop ceph-osd@0

2.查看osd状态

[root@ceph-node01 ceph]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.39067 root default
-3 0.09769 host ceph-node01
 0 hdd 0.09769 osd.0 down 1.00000 1.00000
-5 0.09769 host ceph-node02
 1 hdd 0.09769 osd.1 up 1.00000 1.00000
-7 0.19530 host ceph-node03
 2 hdd 0.19530 osd.2 up 1.00000 1.00000
[root@ceph-node01 ceph]#

可以看到 osd.0 down 已经down掉了,下面我们看下日志原因;

3.查看日志

可以从错误日志看到是systemd 导致的;

4.启动OSD

[root@ceph-node01 ceph]# systemctl start ceph-osd@0^C
[root@ceph-node01 ceph]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.39067 root default
-3 0.09769 host ceph-node01
 0 hdd 0.09769 osd.0 up 1.00000 1.00000
-5 0.09769 host ceph-node02
 1 hdd 0.09769 osd.1 up 1.00000 1.00000
-7 0.19530 host ceph-node03
 2 hdd 0.19530 osd.2 up 1.00000 1.00000
[root@ceph-node01 ceph]#

注意,这里只是简单模拟了一下日志报错,及查看方式,生产环境中问题远远比想想中的困难,只要大家记得遇到问题,不要慌,在不影响服务的情况下(优先提供服务),查看日志是第一要务;

OSD 扩容

扩容分为两种,一种是横向扩容,即增加节点,scale out 模式,这种方式扩容时,需要注意前期的准备工作,时间同步、key信任、yum源、SELinux、防火墙、组件安装等等,另一种是纵向扩容,scale up模式,官网地址:https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

1.查看可用磁盘

[root@ceph-node01 ceph-deploy]# ceph-deploy disk list ceph-node01

2.如果磁盘之前使用过,有分区的话,最好先清理下

[root@ceph-node01 ceph-deploy]# fdisk -l /dev/vdc

磁盘 /dev/vdc:107.4 GB, 107374182400 字节,209715200 个扇区
Units = 扇区 of 1 * 512 = 512 bytes
扇区大小(逻辑/物理):512 字节 / 512 字节
I/O 大小(最小/最佳):512 字节 / 512 字节

[root@ceph-node01 ceph-deploy]# ceph-deploy disk zap ceph-node01 /dev/vdc
。。。
[root@ceph-node01 ceph-deploy]#

3.添加 OSD

[root@ceph-node01 ceph-deploy]# ceph-deploy osd create ceph-node01 --data /dev/vdc
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (2.0.1):
。。。
[ceph-node01][INFO ] Running command: /bin/ceph --cluster=ceph osd stat --format=json
[ceph_deploy.osd][DEBUG ] Host ceph-node01 is now ready for osd use.
[root@ceph-node01 ceph-deploy]#

4.共扩容4块盘

[root@ceph-node01 ceph-deploy]# ceph-deploy osd create ceph-node01 --data /dev/vdc
[root@ceph-node01 ceph-deploy]# ceph-deploy osd create ceph-node02 --data /dev/vdc
[root@ceph-node01 ceph-deploy]# ceph-deploy osd create ceph-node03 --data /dev/vdc
[root@ceph-node01 ceph-deploy]# ceph-deploy osd create ceph-node03 --data /dev/vdd

5.查看扩容状态

[root@ceph-node01 ceph-deploy]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.78142 root default
-3 0.19537 host ceph-node01
 0 hdd 0.09769 osd.0 up 1.00000 1.00000
 3 hdd 0.09769 osd.3 up 1.00000 1.00000
-5 0.19537 host ceph-node02
 1 hdd 0.09769 osd.1 up 1.00000 1.00000
 4 hdd 0.09769 osd.4 up 1.00000 1.00000
-7 0.39067 host ceph-node03
 2 hdd 0.19530 osd.2 up 1.00000 1.00000
 5 hdd 0.09769 osd.5 up 1.00000 1.00000
 6 hdd 0.09769 osd.6 up 1.00000 1.00000
[root@ceph-node01 ceph-deploy]#

数据重分布

1.数据重分布

[root@ceph-node01 ceph-deploy]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_WARN
            Degraded data redundancy: 1366/3003 objects degraded (45.488%), 41 pgs degraded, 10 pgs undersized
            1 pools have too many placement groups

  services:
    mon: 3 daemons, quorum ceph-node01,ceph-node02,ceph-node03 (age 25h)
    mgr: ceph-node01(active, since 21h), standbys: ceph-node02, ceph-node03
    osd: 7 osds: 7 up (since 3m), 7 in (since 3m); 39 remapped pgs
    rgw: 1 daemon active (ceph-node01)

  task status:

  data:
    pools: 7 pools, 320 pgs
    objects: 1.00k objects, 2.6 GiB
    usage: 18 GiB used, 782 GiB / 800 GiB avail
    pgs: 1366/3003 objects degraded (45.488%)
             475/3003 objects misplaced (15.818%)
             250 active+clean
             30 active+recovery_wait+degraded
             27 active+remapped+backfill_wait
             9 active+recovery_wait+undersized+degraded+remapped
             2 active+recovery_wait+degraded+remapped
             1 active+recovering+undersized+remapped
             1 active+recovery_wait

  io:
    client: 5.4 KiB/s rd, 0 B/s wr, 5 op/s rd, 3 op/s wr
    recovery: 9.4 MiB/s, 8 objects/s

[root@ceph-node01 ceph-deploy]#

当刚扩容完时,数据会自动进行重分布,原来我们有3个OSD,目前我们增加到7个OSD,PG 是放在OSD上面,object 对象是放在 PG 里面,扩容之后,我们增了OSD,它会自动的把原来3个OSD上面的PG,动态的迁移一部分PG到新加的4个OSD上面来,让我们看起来整个集群OSD上面的PG数量是均衡的;注意扩容过程中是移动的PG,而不是objcect对象,移动对象有话,会涉及很大的计算量,而移动PG相对计算量要少很多,所以 Ceph 设计数据重分布时,是移动的PG,而非 object 对象;为什么会迁移呢?这是因为我们新增了OSD后,会把这个状态上报给monitor,集群就知道了osd map发生了变动,这时就会触发 rebalancing 重分布,确保集群中OSD上面的PG数量相对均衡,在实际生产环境中,扩容 OSD,就会涉及到 PG 迁移,这就有可能影响性能,比如你一次性扩容了很多OSD进来,那么集群上面的很多PG都涉及到迁移,PG中存在大量的数量,此时,rebalancing的时候,业务有可能受到影响,为了在扩容过程中,尽量对业务的影响,一次性不要扩容哪么多OSD,一个一个慢慢的往上加,这是相对比较好的做法;

2.OSD 重分布线程

[root@ceph-node01 ceph-deploy]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node01.asok config show |grep osd_max_backfills
    "osd_max_backfills": "1",
[root@ceph-node01 ceph-deploy]#

这里的osd_max_backfills,是代表每个OSD最多只有一个线程进行数据迁移同步,之前的版本这个值很大,迁移同步会很快,但会牺牲一部分性能,最好我们采用默认的就可以。

数据重分布时,我们走的是cluster_network网络,所以在生产环境中,这两个网络要分开来使用,以至于在数据重分布时,不会影响publice_network的使用;假如我们只有一个网络,当我们在数据重分布时,影响到了用户使用,可以暂时关闭到数据重分布(ceph osd set norebalance)

3.关闭重分布

[root@ceph-node01 ceph-deploy]# ceph osd set norebalance
norebalance is set
[root@ceph-node01 ceph-deploy]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_WARN
            norebalance flag(s) set

  services:
   。。。
[root@ceph-node01 ceph-deploy]#

4.同时还要关闭掉数据填充的标志位nobackfill

[root@ceph-node01 ceph-deploy]# ceph osd set nobackfill
nobackfill is set
[root@ceph-node01 ceph-deploy]#
[root@ceph-node01 ceph-deploy]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_WARN
            nobackfill,norebalance flag(s) set

  services:
   。。。

[root@ceph-node01 ceph-deploy]#

设置了这两个标志位后,ceph 会暂停数据重分布,这样你的业务就可以访问正常了,当业务访问低峰期时,再把它打开进行重分布即可;

5.恢复重分布

[root@ceph-node01 ceph-deploy]# ceph osd unset nobackfill
nobackfill is unset
[root@ceph-node01 ceph-deploy]# ceph osd unset norebalance
norebalance is unset
[root@ceph-node01 ceph-deploy]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_OK

  services:
   。。。

[root@ceph-node01 ceph-deploy]#

这样就取消掉了,这是关于在扩容注意事项;

删除OSD

官网:https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/

在日常运维过程中,是一个常规操作,假设,我们的OSD盘,坏盘了,硬件故障,此时,我们需要把OSD磁盘剔除出集群,进行一个修改,维修完再添加进来;通常情况下,一块盘坏的话,我们会在demsg里面看到故障信息。

[root@ceph-node01 ceph-deploy]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.78142 root default
-3 0.19537 host ceph-node01
 0 hdd 0.09769 osd.0 up 1.00000 1.00000
 3 hdd 0.09769 osd.3 up 1.00000 1.00000
-5 0.19537 host ceph-node02
 1 hdd 0.09769 osd.1 up 1.00000 1.00000
 4 hdd 0.09769 osd.4 up 1.00000 1.00000
-7 0.39067 host ceph-node03
 2 hdd 0.19530 osd.2 up 1.00000 1.00000
 5 hdd 0.09769 osd.5 up 1.00000 1.00000
 6 hdd 0.09769 osd.6 up 1.00000 1.00000
[root@ceph-node01 ceph-deploy]# ceph osd perf
osd commit_latency(ms) apply_latency(ms)
  6 0 0
  5 0 0
  4 0 0
  0 0 0
  1 0 0
  2 0 0
  3 0 0
[root@ceph-node01 ceph-deploy]#

通常情况下,如果一块盘坏掉了,通过ceph osd tree查看时,处于down 状态,这种很容易处理,但如果还处于up,但磁盘出现坏道等情况,有可能会造成访问延时,影响性能,尽而拖垮整个集群;可能通过 ceph osd perf查看osd的延时,假如一个盘(OSD)的延时特别大,这就有可能这块盘即要快坏了,但是呢,还没有坏的状态;读写的话,延时特别高,长时间的话,可能会对整个集群的性能造成一定的影响,这里是运维过程中的一个小细节;

1.模拟盘坏掉

[root@ceph-node02 ~]# systemctl stop ceph-osd@1

2.查看osd map

[root@ceph-node01 ceph-deploy]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.78142 root default
-3 0.19537 host ceph-node01
 0 hdd 0.09769 osd.0 up 1.00000 1.00000
 3 hdd 0.09769 osd.3 up 1.00000 1.00000
-5 0.19537 host ceph-node02
 1 hdd 0.09769 osd.1 down 1.00000 1.00000
 4 hdd 0.09769 osd.4 up 1.00000 1.00000
-7 0.39067 host ceph-node03
 2 hdd 0.19530 osd.2 up 1.00000 1.00000
 5 hdd 0.09769 osd.5 up 1.00000 1.00000
 6 hdd 0.09769 osd.6 up 1.00000 1.00000
[root@ceph-node01 ceph-deploy]#

3.查看集群状态

[root@ceph-node01 ceph-deploy]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_WARN
            1 osds down
            Degraded data redundancy: 486/3003 objects degraded (16.184%), 58 pgs degraded, 103 pgs undersized

  services:
    mon: 3 daemons, quorum ceph-node01,ceph-node02,ceph-node03 (age 26h)
    mgr: ceph-node01(active, since 22h), standbys: ceph-node02, ceph-node03
    osd: 7 osds: 6 up (since 96s), 7 in (since 74m)
    rgw: 1 daemon active (ceph-node01)

  task status:

  data:
    pools: 7 pools, 224 pgs
    objects: 1.00k objects, 2.6 GiB
    usage: 15 GiB used, 785 GiB / 800 GiB avail
    pgs: 486/3003 objects degraded (16.184%)
             121 active+clean
             58 active+undersized+degraded
             45 active+undersized

[root@ceph-node01 ceph-deploy]#

这个时候就会触发数据重分布,注意这里不是立即进行的,需要等待一段时间 。

1.从osd map中剔除

[root@ceph-node01 ceph-deploy]# ceph osd out osd.1
marked out osd.1.
[root@ceph-node01 ceph-deploy]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.78142 root default
-3 0.19537 host ceph-node01
 0 hdd 0.09769 osd.0 up 1.00000 1.00000
 3 hdd 0.09769 osd.3 up 1.00000 1.00000
-5 0.19537 host ceph-node02
 1 hdd 0.09769 osd.1 down 0 1.00000
 4 hdd 0.09769 osd.4 up 1.00000 1.00000
-7 0.39067 host ceph-node03
 2 hdd 0.19530 osd.2 up 1.00000 1.00000
 5 hdd 0.09769 osd.5 up 1.00000 1.00000
 6 hdd 0.09769 osd.6 up 1.00000 1.00000
[root@ceph-node01 ceph-deploy]#

可以看到权重已经由1.0变成了0,换盘的时候,最好等到数据重分布完成再把磁盘剔除,我们可以看下数据重分布的过程;

[root@ceph-node01 ceph-deploy]# watch -n1 'ceph -s'

我们上面 out 后,此时还在 cursh map 中,可以通过以下命令查看下

[root@ceph-node01 ceph-deploy]# ceph osd crush dump

2.从crush map中删除

[root@ceph-node01 ceph-deploy]# ceph osd crush rm osd.1
removed item id 1 name 'osd.1' from crush map
[root@ceph-node01 ceph-deploy]#

3.此时还在osd map中,需要从osd map中删除

[root@ceph-node01 ceph-deploy]# ceph osd rm osd.1
removed osd.1
[root@ceph-node01 ceph-deploy]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.68373 root default
-3 0.19537 host ceph-node01
 0 hdd 0.09769 osd.0 up 1.00000 1.00000
 3 hdd 0.09769 osd.3 up 1.00000 1.00000
-5 0.09769 host ceph-node02
 4 hdd 0.09769 osd.4 up 1.00000 1.00000
-7 0.39067 host ceph-node03
 2 hdd 0.19530 osd.2 up 1.00000 1.00000
 5 hdd 0.09769 osd.5 up 1.00000 1.00000
 6 hdd 0.09769 osd.6 up 1.00000 1.00000
[root@ceph-node01 ceph-deploy]#

4.删除验证

[root@ceph-node01 ceph-deploy]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-node01,ceph-node02,ceph-node03 (age 26h)
    mgr: ceph-node01(active, since 22h), standbys: ceph-node02, ceph-node03
    osd: 6 osds: 6 up (since 13m), 6 in (since 10m)
    rgw: 1 daemon active (ceph-node01)

  task status:

  data:
    pools: 7 pools, 224 pgs
    objects: 1.00k objects, 2.6 GiB
    usage: 14 GiB used, 686 GiB / 700 GiB avail
    pgs: 224 active+clean

[root@ceph-node01 ceph-deploy]#

通过ceph -s 也可以看到已经删除了;

5.其实还有一个认证信息

[root@ceph-node01 ceph-deploy]# ceph auth list
installed auth entries:

osd.0
  key: AQBFDnlfXyOsChAAjKIfUGiDXb6kGs826LkOsA==
  caps: [mgr] allow profile osd
  caps: [mon] allow profile osd
  caps: [osd] allow *
osd.1
  key: AQB7NXNfFEIDChAAbseU3a/rbRgW8UMJiHRikQ==
  caps: [mgr] allow profile osd
  caps: [mon] allow profile osd
  caps: [osd] allow *
。。。
[root@ceph-node01 ceph-deploy]#

6.删除认证key

[root@ceph-node01 ceph-deploy]# ceph auth rm osd.1
updated
[root@ceph-node01 ceph-deploy]#

因为我们后面添加磁盘时,还有可能添加一块osd.1的磁盘进来;

数据一致性

官网:https://docs.ceph.com/en/latest/architecture/#data-consistency

DATA CONSISTENCY

As part of maintaining data consistency and cleanliness, Ceph OSDs can also scrub objects within placement groups. That is, Ceph OSDs can compare object metadata in one placement group with its replicas in placement groups stored in other OSDs. Scrubbing (usually performed daily) catches OSD bugs or filesystem errors. OSDs can also perform deeper scrubbing by comparing data in objects bit-for-bit. Deep scrubbing (usually performed weekly) finds bad sectors on a disk that weren’t apparent in a light scrub.

作为保持数据一致性和整洁度的一部分,Ceph OSD 还可以清理归置组中的对象,也就是说,Ceph OSD 可以将一个归置组中的对象元数据与其存储在其他 OSD 中的放置组中的副本进行比较。清理(通常每天执行一次)会捕获 OSD 错误或文件系统错误;OSD 还可以通过逐位比较对象中的数据来执行更深入的清理,深度清理(通常每周执行一次)会发现磁盘上的坏扇区在轻微清理中并不明显。

SCRUBBING

In addition to making multiple copies of objects, Ceph ensures data integrity by scrubbing placement groups. Ceph scrubbing is analogous to fsck on the object storage layer. For each placement group, Ceph generates a catalog of all objects and compares each primary object and itsreplicas to ensure that no objects are missing or mismatched. Light scrubbing (daily) checks the object size and attributes. Deep scrubbing (weekly) reads the data and uses checksums to ensure data integrity.

Ceph 除了制作对象的多个副本外,还要通过清理归置组来确保数据完整性。Ceph清理类似于对象存储层上的fsck, 对于每个归置组,Ceph 都会生成所有对象的目录,并比较每个主对象及其副本对象,以确保没有对象丢失或不匹配。轻度清理(每天)检查对象的大小和属性等, 深度清理(每周一次)读取数据并使用校验和以确保数据完整性。

Scrubbing is important for maintaining data integrity, but it can reduce performance. You can adjust the following settings to increase or decrease scrubbing operations.

清理对于保持数据完整性很重要,但是会降低性能,您可以调整以下设置以增加或减少清理操作。

官网:https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#scrubbing

清理默认是PG级别的, 通过 ceph pg dump 拿到pgid。

1.轻度清理

[root@ceph-node01 ~]# ceph pg scrub 5.1b
instructing pg 5.1b on osd.2 to scrub
[root@ceph-node01 ~]#

2.深度检查

[root@ceph-node01 ~]# ceph pg deep-scrub 5.1b
instructing pg 5.1b on osd.2 to deep-scrub
[root@ceph-node01 ~]#

可以通过watch -n1 'ceph -s' 查看scrub的过程 ;

Ceph 监控集群状态

官网:https://docs.ceph.com/en/latest/rados/operations/monitoring/

监控集群有两种方式,一种是 ceph 进入 ceph 命令交互方式,一种是 ceph + 参数的形式;

[root@ceph-node01 ceph]# ceph
ceph> status
  cluster:
。。。
    pgs: 320 active+clean

ceph> df
RAW STORAGE:
    CLASS SIZE AVAIL USED RAW USED %RAW USED
    hdd 400 GiB 390 GiB 6.7 GiB 9.7 GiB 2.43
    TOTAL 400 GiB 390 GiB 6.7 GiB 9.7 GiB 2.43

POOLS:
    POOL ID STORED OBJECTS USED %USED MAX AVAIL
    ceph-demo 1 1.1 GiB 320 2.3 GiB 0.61 184 GiB
    。。。
    default.rgw.buckets.data 7 1.5 GiB 440 4.4 GiB 1.19 122 GiB

ceph>

1.查看集群状态 ceph -s

[root@ceph-node01 ceph]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_WARN
            too many PGs per OSD (277 > max 250)

  services:
    mon: 3 daemons, quorum ceph-node01,ceph-node02,ceph-node03 (age 3d)
    mgr: ceph-node01(active, since 6d), standbys: ceph-node02, ceph-node03
    osd: 3 osds: 3 up (since 15m), 3 in (since 5d)
    rgw: 1 daemon active (ceph-node01)

  task status:

  data:
    pools: 7 pools, 320 pgs
    objects: 1.00k objects, 2.6 GiB
    usage: 9.7 GiB used, 390 GiB / 400 GiB avail
    pgs: 320 active+clean

  io:
    client: 4.0 KiB/s rd, 0 B/s wr, 3 op/s rd, 2 op/s wr

[root@ceph-node01 ceph]#

2.动态查看集群状态 ceph -w ,在故障排查时有很用;

3.集群利用率 ceph dfceph osd dfrados df

[root@ceph-node01 ~]# ceph df
RAW STORAGE:
    CLASS SIZE AVAIL USED RAW USED %RAW USED
    hdd 700 GiB 686 GiB 8.3 GiB 14 GiB 2.05
    TOTAL 700 GiB 686 GiB 8.3 GiB 14 GiB 2.05

POOLS:
    POOL ID STORED OBJECTS USED %USED MAX AVAIL
    ceph-demo 1 1.1 GiB 639 3.4 GiB 0.53 213 GiB
    .rgw.root 2 1.2 KiB 4 768 KiB 0 213 GiB
    default.rgw.control 3 0 B 8 0 B 0 213 GiB
    default.rgw.meta 4 4.4 KiB 16 2.8 MiB 0 213 GiB
    default.rgw.log 5 0 B 207 0 B 0 213 GiB
    default.rgw.buckets.index 6 0 B 6 0 B 0 213 GiB
    default.rgw.buckets.data 7 1.5 GiB 440 4.4 GiB 0.69 213 GiB
[root@ceph-node01 ~]# ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
 0 hdd 0.09769 1.00000 100 GiB 2.4 GiB 1.4 GiB 102 KiB 1024 MiB 98 GiB 2.42 1.18 94 up
 3 hdd 0.09769 1.00000 100 GiB 2.4 GiB 1.4 GiB 0 B 1 GiB 98 GiB 2.36 1.15 130 up
 4 hdd 0.09769 1.00000 100 GiB 3.7 GiB 2.7 GiB 0 B 1 GiB 96 GiB 3.69 1.80 224 up
 2 hdd 0.19530 1.00000 200 GiB 2.6 GiB 1.6 GiB 74 KiB 1024 MiB 197 GiB 1.32 0.64 112 up
 5 hdd 0.09769 1.00000 100 GiB 1.5 GiB 555 MiB 0 B 1 GiB 98 GiB 1.54 0.75 57 up
 6 hdd 0.09769 1.00000 100 GiB 1.7 GiB 707 MiB 0 B 1 GiB 98 GiB 1.69 0.83 55 up
                    TOTAL 700 GiB 14 GiB 8.3 GiB 177 KiB 6.0 GiB 686 GiB 2.05
MIN/MAX VAR: 0.64/1.80 STDDEV: 0.80
[root@ceph-node01 ~]# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
.rgw.root 768 KiB 4 0 12 0 0 0 66 66 KiB 4 4 KiB 0 B 0 B
ceph-demo 3.4 GiB 639 0 1917 0 0 0 5498 20 MiB 1159 2.5 GiB 0 B 0 B
default.rgw.buckets.data 4.4 GiB 440 0 1320 0 0 0 9558 7.0 MiB 25280 1.5 GiB 0 B 0 B
default.rgw.buckets.index 0 B 6 0 18 0 0 0 29194 31 MiB 14470 7.1 MiB 0 B 0 B
default.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B
default.rgw.log 0 B 207 0 621 0 0 0 344445 336 MiB 229532 2 KiB 0 B 0 B
default.rgw.meta 2.8 MiB 16 0 48 0 0 0 651 549 KiB 226 80 KiB 0 B 0 B

total_objects 1001
total_used 14 GiB
total_avail 686 GiB
total_space 700 GiB
[root@ceph-node01 ~]#


1. ceph osd stat 查看osd状态

2. ceph osd tree 查看osd map

3.  ceph osd dump 查看osd详细信息

4. ceph osd df 查看osd利用率

5. ceph osd pool autoscale-status 查看pool池利用率


[root@ceph-node01 ~]# ceph osd pool autoscale-status
Error ENOTSUP: Module 'pg_autoscaler' is not enabled (required by command 'osd pool autoscale-status'): use `ceph mgr module enable pg_autoscaler` to enable it
[root@ceph-node01 ~]# ceph mgr module enable pg_autoscaler
[root@ceph-node01 ~]# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
ceph-demo 1154M 3.0 399.9G 0.0085 1.0 128 32 warn
default.rgw.meta 4501 3.0 399.9G 0.0000 1.0 32 warn
default.rgw.buckets.index 0 3.0 399.9G 0.0000 1.0 32 warn
default.rgw.control 0 3.0 399.9G 0.0000 1.0 32 warn
default.rgw.buckets.data 1503M 3.0 399.9G 0.0110 1.0 32 warn
.rgw.root 1245 3.0 399.9G 0.0000 1.0 32 warn
default.rgw.log 0 3.0 399.9G 0.0000 1.0 32 warn
[root@ceph-node01 ~]#


1. ceph mon stat  查看mon状态

2. ceph mon dump   查看mon详细信息

3. ceph quorum_status 查看选举状态

    ceph quorum_status --format json-pretty


1. ceph mds stat 查看mds 状态

2. ceph fs dump  查看文件系统详细信息

1.查看某个进程的配置信息

[root@ceph-node01 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node01.asok help

2.查看进程配置参数

[root@ceph-node01 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node01.asok config show | more
{
    "name": "mon.ceph-node01",
    "cluster": "ceph",
    "admin_socket": "/var/run/ceph/ceph-mon.ceph-node01.asok",
    "admin_socket_mode": "",
    "auth_client_required": "cephx",
    "auth_cluster_required": "cephx",
    "auth_debug": "false",
    "auth_mon_ticket_ttl": "43200.000000",
。。。
[root@ceph-node01 ~]#

通过 --admin-daemon 指定sock文件,最原始的方式查看配置;

3.可以通过socket调节参数

[root@ceph-node01 ceph-deploy]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node01.asok config set mon_max_pg_per_osd 500
{
    "success": ""
}
[root@ceph-node01 ceph-deploy]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node01.asok config show|grep mon_max_pg_per_osd
    "mon_max_pg_per_osd": "500",
[root@ceph-node01 ceph-deploy]#

这种情况下,重启后就失效了,一定要注意;

Ceph 资源池管理

POOLS

Pools are logical partitions for storing objects.

When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data. A pool provides you with:

  • Resilience(自恢复能力): You can set how many OSD are allowed to fail without losing data. For replicated pools, it is the desired number of copies/replicas of an object. A typical configuration stores an object and one additional copy (i.e., size = 2), but you can determine(确定,确认) the number of copies/replicas. For erasure coded pools, it is the number of coding chunks (i.e. m=2 in the erasure code profile)。

  • Placement Groups(归置归,目录): You can set the number of placement groups for the pool. A typical configuration uses approximately(大约) 100 placement groups per OSD to provide optimal balancing without using up too many computing resources. When setting up multiple pools, be careful to ensure you set a reasonable (合理的)number of placement groups for both the pool and the cluster as a whole.

你能为 pool 设置归置组,一个典型配置,每个OSD,大约设置100个归置组,从而达到最佳平衡而不需要消费大量的计算资源。当你设置多个存储池时,作为一个整体集群,要注意确保集群设置合理数量的归置组数量。

  • CRUSH Rules(CURSH 规则): When you store data in a pool, placement of the object and its replicas (or chunks for erasure coded pools) in your cluster is governed(控制、治理、主宰) by CRUSH rules. You can create a custom CRUSH rule for your pool if the default rule is not appropriate(适合) for your use case.

  • Snapshots(快照): When you create snapshots with ceph osd pool mksnap, you effectively take a snapshot of a particular pool.

当你使用ceph osd pool mksnap 创建快照时,可以有效的为一个特定的pool创建快照。

1.pool池创建和查看

[root@ceph-node01 ~]# ceph osd pool create pool-demo 16 16
pool 'pool-demo' created
[root@ceph-node01 ~]# ceph osd lspools
1 ceph-demo
2 .rgw.root
3 default.rgw.control
4 default.rgw.meta
5 default.rgw.log
6 default.rgw.buckets.index
7 default.rgw.buckets.data
8 pool-demo
[root@ceph-node01 ~]#

注意有很多可用的参数,见官网:https://docs.ceph.com/en/latest/rados/operations/erasure-code/

2.pool 池常见操作

[root@ceph-node01 ~]# ceph
ceph> osd pool get pool-demo size
size: 3

ceph> exit
[root@ceph-node01 ~]# ceph
ceph> osd pool get pool-demo size
size: 3

ceph> osd pool get pool-demo pg_num
pg_num: 16

ceph> osd pool set pool-demo pg_num 128
set pool 8 pg_num to 128
ceph> osd pool set pool-demo pgp_num 128
set pool 8 pgp_num to 128
ceph>

pool池创建完之后 ,需要使用application把pool池进行一个分类,主要有三种类型,rbd,rgw,cephfs

3.pool 池类型启用

[root@ceph-node01 ~]# ceph osd pool application enable pool-demo rbd
enabled application 'rbd' on pool 'pool-demo'
[root@ceph-node01 ~]#

4.查看pool池类型

[root@ceph-node01 ~]# ceph osd pool application get pool-demo
{
    "rbd": {}
}
[root@ceph-node01 ~]#

这样就关联好了,注意如果你创建了pool池没有关联类型的话,使用时会报错;

5.pool池设置配额

[root@ceph-node01 ~]# ceph osd pool set-quota pool-demo max_objects 100
set-quota max_objects = 100 for pool pool-demo
[root@ceph-node01 ~]#

6.查看pool池配额

[root@ceph-node01 ~]# ceph osd pool get-quota pool-demo
quotas for pool 'pool-demo':
  max objects: 100 objects
  max bytes : N/A
[root@ceph-node01 ~]#

7.可以为pool池设置快照;

8.删除pool池

[root@ceph-node01 ~]# ceph osd pool delete pool-demo pool-demo --yes-i-really-really-mean-it
Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool
[root@ceph-node01 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node01.asok config show |grep mon_allow_pool_delete
    "mon_allow_pool_delete": "false",
[root@ceph-node01 ~]#

删除时,需要mon允许,并且pool池的名称需要写两次;

[root@ceph-node01 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceph-node01.asok config set mon_allow_pool_delete true
{
    "success": "mon_allow_pool_delete = 'true' "
}
[root@ceph-node01 ~]# ceph osd pool delete pool-demo pool-demo --yes-i-really-really-mean-it
pool 'pool-demo' removed
[root@ceph-node01 ~]#

PG 管理

官网:https://docs.ceph.com/en/latest/rados/operations/placement-groups/

AUTOSCALING PLACEMENT GROUPS (PG动态扩容的方法)

Placement groups (PGs) are an internal implementation detail of how Ceph distributes data. You can allow the cluster to either make recommendations or automatically tune PGs based on how the cluster is used by enabling pg-autoscaling.

Each pool in the system has a pg_autoscale_mode property that can be set to off, on, or warn.

  • off: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to Choosing the number of Placement Groups for more information.

  • on: Enable automated adjustments of the PG count for the given pool.

  • warn: Raise health alerts when the PG count should be adjusted

To set the autoscaling mode for existing pools,:

ceph osd pool set <pool-name> pg_autoscale_mode <mode>

For example to enable autoscaling on pool foo,:

ceph osd pool set foo pg_autoscale_mode on

You can also configure the default pg_autoscale_mode that is applied to any pools that are created in the future with:

ceph config set global osd_pool_default_pg_autoscale_mode <mode>

HOW ARE PLACEMENT GROUPS USED ? (PG如何使用)

A placement group (PG) aggregates(聚集、合计、并集) objects within a pool because tracking object placement and object metadata on a per-object basis is computationally expensive–i.e., a system with millions of objects cannot realistically track placement on a per-object basis.

归置组聚集池中的对象,因为以每个对象进行跟踪它的存储位置和它的元数据信息时计算是昂贵的,即一个系统有数百万个对象无法实际跟踪每个对象的存储位置信息等;

The Ceph client will calculate(计算) which placement group an object should be in. It does this by hashing the object ID and applying an operation based on the number of PGs in the defined pool and the ID of the pool. See Mapping PGs to OSDs for details.

The object’s contents within a placement group are stored in a set of OSDs. For instance(例如), in a replicated pool of size two, each placement group will store objects on two OSDs, as shown below.

Should OSD #2 fail, another will be assigned to Placement Group #1 and will be filled with copies of all objects in OSD #1. If the pool size is changed from two to three, an additional(额外) OSD will be assigned(分配) to the placement group and will receive copies of all objects in the placement group.

Placement groups do not own the OSD; they share it with other placement groups from the same pool or even other pools. If OSD #2 fails, the Placement Group #2 will also have to restore copies of objects, using OSD #3.

When the number of placement groups increases, the new placement groups will be assigned OSDs. The result of the CRUSH function will also change and some objects from the former placement groups will be copied over to the new Placement Groups and removed from the old ones.

PG 越多,越分散到不同的OSD上面,从而数据安全性也越高。

PG 有两个作用:1. 决定数据分布的情况,2. 提升计算的效率,假如没有PG的话,每个文件切分成多个object对象,每个object会落在osd上面 ,我们object是成千上万,非常的多,都需要ceph进行计算的话,计算负荷会非常的大,所以呢,用PG存放object对象,Ceph 使用CURSH map算法进行计算时,只要计算PG落在哪个OSD上面,这样大大加快了计算速度,决定数据分布;

CHOOSING THE NUMBER OF PLACEMENT GROUPS(选择归置组的数量)

If you have more than 50 OSDs, we recommend(建议) approximately 50-100 placement groups per OSD to balance out resource usage, data durability(持久) and distribution(分发). If you have less than 50 OSDs, choosing among the preselection(预选) above is best. For a single pool of objects, you can use the following formula(公式、配方) to get a baseline ( 对于单个对象池来讲,您可以使用以下公式获取基线)

Total PGs = OSDs * 100 / pool size

Where pool size is either the number of replicas for replicated pools or the K+M sum for erasure coded pools (as returned by ceph osd erasure-code-profile get).

You should then check if the result makes sense with the way you designed your Ceph cluster to maximize data durability, object distribution and minimize resource usage.

你应该检查结果是否为你设计的Ceph 集群,并且它能以最大程度的提高数据持久性,对象分配并且最小化资源使用;

The result should always be rounded up to the nearest power of two.

结果应该最终四舍五入到最接近2的幂次方。

Only a power of two will evenly balance the number of objects among placement groups. Other values will result in an uneven(不均衡/允) distribution of data across your OSDs. Their use should be limited to incrementally stepping from one power of two to another.

只有2的幂次方可以平衡PG数量,其它值将导致OSD上面的数据分布不均,它们的使用最近的2的N次幂,(如33应该取32,63应该取64)。

As an example, for a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate your number of PGs as follows

200 * 100 / 3 = 6667. Nearest power of 2: 8192

When using multiple data pools for storing objects, you need to ensure that you balance the number of placement groups per pool with the number of placement groups per OSD so that you arrive at a reasonable total number of placement groups that provides reasonably low variance per OSD without taxing system resources or making the peering process too slow.

当使用多个数据池存储对象时,您需要确保在每个池的归置组数量与每个OSD的归置组数量之间取得平衡,以便获得合理的归置组总数,从而为每个OSD提供合理的低差异 而不会增加系统资源的负担或使对等进程太慢。

For instance a cluster of 10 pools each with 512 placement groups on ten OSDs is a total of 5,120 placement groups spread over ten OSDs, that is 512 placement groups per OSD. That does not use too many resources. However, if 1,000 pools were created with 512 placement groups each, the OSDs will handle ~50,000 placement groups each and it would require significantly more resources and time for peering.

例如,一个由10个池组成的群集,每个池在10个OSD上具有512个放置组,则总共有5120个放置组分布在10个OSD上,即每个OSD 512个放置组。那不会使用太多资源。但是,如果创建了1,000个池,每个池有512个放置组,则OSD将分别处理约50,000个放置组,并且将需要更多的资源和时间来进行对等。

You may find the PGCalc tool helpful. 官方:https://ceph.com/pgcalc/

A PRESELECTION OF PG_NUM(osd少于50个时使用预选PG_NUM)

When creating a new pool with:

ceph osd pool create {pool-name} [pg_num]

it is optional to choose the value of pg_num. If you do not specify pg_num, the cluster can (by default) automatically tune it for you based on how much data is stored in the pool (see above, Autoscaling placement groups).

Alternatively(或), pg_num can be explicitly provided. However, whether you specify a pg_num value or not does not affect whether the value is automatically tuned by the cluster after the fact. To enable or disable auto-tuning,:

选择pg_num的值是可选的。如果您未指定pg_num,则集群可以(默认情况下)根据池中存储的数据自动为您进行调整(请参见上文,自动缩放放置组)。

或者,可以显式提供pg_num。但是,是否指定pg_num值并不影响事实之后群集是否自动调整该值。要启用或禁用自动调整,请执行以下操作:

ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)

The “rule of thumb”(经验法则) for PGs per OSD has traditionally(传统上) be 100. 统通常情况下,根据经验,每个OSD设置100个PG。

With the additional of the balancer(加上额外平衡器) (which is also enabled by default,默认开启), a value of more like 50 PGs per OSD is probably reasonable(可能比较合理). The challenge (which the autoscaler normally does for you 自动缩放器为你完成), is to:

  • have the PGs per pool proportional to the data in the pool, and(使每个池中的PG与池中的数据成比例)

  • end up with 50-100 PGs per OSDs, after the replication or erasuring-coding fan-out of each PG across OSDs is taken into consideration(考虑了每个PG在OSD上使用复制或纠删码机制后,最终每个OSD会有50-100个PG)。

归置组的数量由管理员在创建存储池时指定,而后由CRUSH负责创建和使用,(1)通常,PG的数量应该是数据的合理粒度的子集,例如,一个包括256个PG的存储池意味着每个PG包含大约1/256的存储池数据;(2)当需要将PG从一个OSD移动到另一个OSD时,PG的数量会对性能产生影响,PG数量过少,Ceph将不得不同时移动相当数量的数据,其产生的网络负载将对集群中的正常性能输出产生负面影响,而在过多的PG数量场景中在移动极少量的数据时,Ceph将会占用过多的CPU和RAM,从而对集群的计算资源产生负责影响。(3)PG数量在集群分发数据和重新平衡时扮演着重要作用,在所有OSD之间进行数据持久存储及完成数据分布会需要较多的归置组,但是它们的数量应该减少到最大性能所需的最小数量值,以节省CPU和内存资源;一个RADOS集群上可能会存在多个存储池,因此管理员还需要考虑所有存储池上的PG分布后每个OSD需要映射的PG数量;

依据PG当前的工作特性或工作进程所处的阶段,它总是处于某个或某些个状态中,最为常见的状态应该为:“active+clean”,还有一个Peering状态;

Active:主OSD和各辅助OSD均处于就绪状态,可正常服务于客户端的IO请求;一般,Peering 操作过程完成后即会转入Active 状态;

Clean:主OSD和各辅助OSD均处于就绪状态,所有对象的副本数均符合期望,并且PG的活动集和上行集是为同一组OSD; 活动集(Acting Set):由PG当前的主OSD和所有的处于活动状态的辅助OSD组成,这组OSD负责执行此PG上数据对象的存取操作IO;上行集(UP Set):根据CRUSH的工作方式,集群拓扑架构的变动将可能导致PG相应的OSD变动或扩展至其它的OSD之上,这个新的OSD集也称为PG的上行集,其映射到新的OSD集可能部分与原有OSD集重合,也可能会完全不相干;上行集OSD需要从当前的活动集OSD上复制数据对象,在所有对象同步完成后,上行集便成为新的活动集,而PG也将转为Active状态;

Peering:一个PG中的所有OSD必须就它们持有的数据对象状态达成一致,而对等(Peering)即为让其OSD从不一致转为一致的过程,数据复制的中间过程,即对等状态;

Degraded:降级状态,在某OSD标记为down时,所有映射到些OSD的PG即转入降级状态,在此OSD重新启动并完成Peering操作后,PG将重新转回Clean状态,一旦OSD标记为Down状态时间超过5分钟,它将被标记出集群,而后Ceph将对降级状态的PG启动恢复操作,直到所有因此而降级的PG重回clean状态;在其内部OSD上某对象不可用或悄然崩溃时,PG也会被标记为降级状态,直到对象从某个权威副本上正确恢复;

Stale:每个主OSD都要周期性的向RADOS集群中的监视器报告其作为主OSD所持有的所有PG的最新统计数据,因任何原因导致某个主OSD无法正常向监视器发送此类报告,或者由其它OSD 报告某个OSD已经down掉,则所有以此OSD为主OSD的PG将立即被标记为state状态;处于此状态的话,会重新选举主OSD;

Undersized:PG中的副本数少于其存储池定义的个数时,即转入Undersized状态,恢复和回填操作在随后会启动以修复其副本数为期望值;

Scrubbing 状态:各OSD还需要周期性地检查其所持有的数据对象的完整性,以确保所有对等OSD上的数据一致,处于此类检查过程中的PG便会被标记为Scrubbing状态,这也通常被称为light scrubs(经量一直性检测)、shallow scrubs或者 simply scrubs;另外,PG还偶尔需要进行deep scrubs(深度一直性检测,按位进行匹配)检查以确保同一对象在相关的各OSD上能按位匹配,此时PG将处于Scrubbing+deep状态;

Recovering:添加一个新的OSD至存储集群中或者OSD宕掉时,PG则有可能会被CRUSH重新映射,进而将持有与此不同的OSD集,而这些处于内部数据同步过程中的PG则被标记为Recovering状态;

Backfilling 状态:新OSD加入存储集群后,Ceph则会进入数据重新均衡的状态,即一些数据对象会在进程后台从现有OSD移动到新的OSD之上,此操作过程即为backfill状态,回填操作;

[root@ceph-node01 ~]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_ERR
            2 scrub errors
            Possible data damage: 2 pgs inconsistent

  services:
    mon: 3 daemons, quorum ceph-node01,ceph-node02,ceph-node03 (age 3d)
    mgr: ceph-node01(active, since 3d), standbys: ceph-node02, ceph-node03
    osd: 6 osds: 6 up (since 2d), 6 in (since 2d)
    rgw: 1 daemon active (ceph-node01)

  task status:

  data:
    pools: 7 pools, 224 pgs
    objects: 1.00k objects, 2.6 GiB
    usage: 15 GiB used, 685 GiB / 700 GiB avail
    pgs: 222 active+clean
             2 active+clean+inconsistent

[root@ceph-node01 ~]#
[root@ceph-node01 ceph-deploy]# ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
    pg 1.13 is active+clean+inconsistent, acting [3,6,4]
    pg 1.1c is active+clean+inconsistent, acting [6,4,3]
[root@ceph-node01 ceph-deploy]# ceph pg repair 1.13
instructing pg 1.13 on osd.3 to repair
[root@ceph-node01 ceph-deploy]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
    pg 1.1c is active+clean+inconsistent, acting [6,4,3]
[root@ceph-node01 ceph-deploy]# ceph pg repair 1.1c
instructing pg 1.1c on osd.6 to repair
[root@ceph-node01 ceph-deploy]# ceph health detail
HEALTH_OK
[root@ceph-node01 ceph-deploy]# ceph -s
  cluster:
    id: cc10b0cb-476f-420c-b1d6-e48c1dc929af
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-node01,ceph-node02,ceph-node03 (age 3d)
    mgr: ceph-node01(active, since 3d), standbys: ceph-node02, ceph-node03
    osd: 6 osds: 6 up (since 2d), 6 in (since 2d)
    rgw: 1 daemon active (ceph-node01)

  task status:

  data:
    pools: 7 pools, 224 pgs
    objects: 1.00k objects, 2.6 GiB
    usage: 15 GiB used, 685 GiB / 700 GiB avail
    pgs: 224 active+clean

[root@ceph-node01 ceph-deploy]#

总结

本文总结了 Ceph 常见运维操作:守护进程管理、日志文件目录、集群状态信息查看、资源池、归置组管理、OSD扩容、数据重分布、PG数计算等。

ceph -s 最常用的查看集群信息,包括 集群ID、集群运行状况、监视器地图版本号和监视器仲裁的状态、OSD地图版本号和OSD的状态、PG归置组地图版本、归置组和存储池数量、所存储数据理论上的数量和所存储对象的数据、所存储数据的总量等;

ceph pg stat:查看PG的状态信息

[root@ceph-node01 ~]# ceph pg stat
288 pgs: 288 active+clean; 9.8 GiB data, 29 GiB used, 764 GiB / 800 GiB avail
[root@ceph-node01 ~]#

ceph osd pool stats [pool池名称]:查看存储池的信息

[root@ceph-node01 ~]# ceph osd pool stats ceph-demo
pool ceph-demo id 1
  nothing is going on

[root@ceph-node01 ~]#

ceph df: 显示整个集群空间的

[root@ceph-node01 ~]# ceph df
RAW STORAGE: # 存储量冷总览
    CLASS SIZE AVAIL USED RAW USED %RAW USED
    hdd 400 GiB 368 GiB 29 GiB 32 GiB 8.01
    ssd 400 GiB 396 GiB 358 MiB 4.3 GiB 1.09
    TOTAL 800 GiB 764 GiB 29 GiB 36 GiB 4.55

POOLS: # 存储池列表和每个存储池的理论用量
    POOL ID STORED OBJECTS USED %USED MAX AVAIL
    ceph-demo 1 8.0 GiB 2.13k 8.0 GiB 2.32 112 GiB
    .rgw.root 2 1.2 KiB 4 1.2 KiB 0 112 GiB
    default.rgw.control 3 0 B 8 0 B 0 112 GiB
    default.rgw.meta 4 5.6 KiB 20 5.6 KiB 0 112 GiB
    default.rgw.log 5 0 B 207 0 B 0 112 GiB
    default.rgw.buckets.index 6 18 KiB 8 18 KiB 0 112 GiB
    default.rgw.buckets.data 7 1.5 GiB 440 1.5 GiB 0.43 112 GiB
    cephfs_metadata 9 501 KiB 22 501 KiB 0 112 GiB
    cephfs_data 10 9.1 KiB 100 9.1 KiB 0 112 GiB
    kubernetes 11 315 MiB 138 315 MiB 0.09 112 GiB
[root@ceph-node01 ~]#

ceph df detail: 显示更加详细的磁盘信息;

ceph osd stat 显示OSD 状态

ceph osd dump 显示OSD 状态

ceph osd tree   显示CRUSH运行图中的OSD及主机映射关系、权重等信息

ceph mon stat 显示监视器映射

ceph mon dump 显示监视器映射

ceph quorum_status 集群中有多个mon时,可以显示仲裁结果信息

Ceph 的管理套接字接口常用于查询守护进程,套接字默认保存在/var/run/ceph/目录中,命令使用格式:ceph --admin-daemon /var/run/ceph/${socket-name} help

ceph --admin-daemon /var/run/ceph/${socket-name} version

ceph --admin-daemon /var/run/ceph/${socket-name} status


停前:告知 ceph 集群不要将 OSD 标记为 Out(ceph osd set noout)

停:存储客户端 -- 网关(rgw)-- 元数据服务器 (mds)-- osd -- mgr -- mon

启:mon -- mgr -- osd -- mds -- rgw -- 存储客户端

启后:ceph osd unset noout


1. 以.ini格式进行编排,内部以#或者;号开头为注释;

2. 有全局配置Global、OSD全局配置、Client客户端全局配置、但也可以针对每个OSD或者mon做更细粒度的配置;

3. ceph 客户端加载配置文件顺序:使用$CEPH_CONF环境变量、使用-c指定、/etc/ceph/ceph.conf 、~/.ceph/config、./ceph.conf(current directory)