Ceph 集群部署

本次部署会直接使用和改造 ceph-ansible-4.0 工具。由于内部的机器都能访问外网。所以会采用有网络部署方式，直接使用阿里云的 ceph 源。这里主要搭建 Ceph Nautilus 版本 (14) 的集群，将会在该版本上进行操作和实验。

1. 初始化集群节点

这部分完成了一个简单的 playbook ，主要是实现以下几个功能：

拷贝 master 节点的 /etc/hosts 到所有其他主机；
对各个主机按照 /etc/hosts 的名称重命名；
每个主机新建 store 用户以及 store 户组；然后跳板机对所有主机拷贝其密钥实现免密登录；
如果所有机器都能联网，则直接给所有机器添加每分钟执行的定时任务，定时向阿里云同步时钟，确保集群内时钟正常;

2. ceph-ansible 项目介绍

ceph-ansible 是 ceph 官方出品的用于自动化部署 ceph 集群的工具，点此直达官方文档。注意官网说明，有坑误踩：

注：本人曾经在部署时候犯了个错误，浪费了大半天的时间。当时是19年年中，那会 ceph-ansible 的最新是 4.x 版本，ceph 最新为 N 版本。我拿 master 分支代码去部署 ceph mimic 版本，总是失败。纠结了好久后才发现应该选择 ceph-ansible-3.2的版本去部署 mimic 版本，血泪教训啊！

此外，官方文档中对于 ceph 集群的部署有如下说明：

推荐使用 Cephadmin：容器化部署 ceph 集群；
ceph-ansible: 基于 ansible 完成的部署 ceph 集群工具，广泛使用中；
ceph-deploy：不在维护。不再支持 ceph 14 版本之后以及不支持 CentOS 8;
deepsea：基于 saltstack 部署 ceph集群，比较少用；
支持手工部署，能学到更多，有没有兴趣尝试下？

ceph-ansible 的使用非常简单，按照官方文档，写好部署 mon、mgr、rgw 等的服务机器列表即可。参考如下：

[store@master ceph-ansible-4.0.20-spyinx]$ cat hosts
[mons]
ceph-[1:3][mgrs]
ceph-[1:3][osds]
ceph-[1:3][rgws]
ceph-[1:2]

如果只是单纯的使用 ceph-ansible-4.0 工具，我们只需要调整 group_vars/all.yml 中的相关配置即可：

[store@master ceph-ansible-4.0.20-spyinx]$ cat group_vars/all.yml
---
############下载包相关的源配置###############
ceph_origin: repository
ceph_repository: community
ceph_mirror: https://mirrors.aliyun.com/ceph/
ceph_stable_key: "{{ ceph_mirror }}/keys/release.asc"
ceph_stable_release: nautilus
ceph_stable_repo: "{{ ceph_mirror }}/debian-{{ ceph_stable_release }}"
ceph_stable_distro_source: bioniccephx: "true"############非常重要参数###################
public_network: 192.168.26.0/24
cluster_network: 192.168.26.0/24mon_host: 192.168.26.120
mon_initial_members: ceph-1,ceph-2,ceph-3
monitor_interface: ens33
#########################################rbd_cache: "true"
rbd_cache_writethrough_until_flush: "true"
rbd_concurrent_management_ops: 21
rbd_client_directories: true############创建osd必要的参数###############
osd_objectstore: bluestore
devices:- '/dev/sdb'- '/dev/sdc'- '/dev/sdd'
osd_scenario: lvm
#########################################mds_max_mds: 1
radosgw_frontend_type: beast
radosgw_thread_pool_size: 512
radosgw_interface: "{{ monitor_interface }}"
email_address: 2894577759@qq.com
dashboard_enabled: False
dashboard_protocol: http
dashboard_port: 8443
dashboard_admin_user: admin
dashboard_admin_password: admin@123!
grafana_admin_user: admin
grafana_admin_password: admin
grafana_uid: 472
grafana_datasource: Dashboard
grafana_dashboard_version: nautilus
grafana_port: 3000
grafana_allow_embedding: True

直接安装即可，基本上只要网络正常，就不会存在多大问题。注意，如果内网的话，可以把对应的防护墙给关掉。默认的 ceph-ansible 每次跑的时候都会自动把防火墙启起来，有时候会有影响。

[store@master ceph-ansible-4.0.20-spyinx]$ cat roles/ceph-defaults/defaults/main.yml | grep configure_firewall
# If configure_firewall is true, then ansible will try to configure the
configure_firewall: False

在 ceph-ansible 中存在一个问题和我们的业务不兼容，还需要对 ceph-ansible的代码进行改造，使之符合我们的业务需求，请看下面的内容。

3. 改造 ceph-ansible-4.0

我们改造 ceph-ansible-4.0 主要是解决创建 osd 进程的 id 顺序问题，以保证在一台机器上创建的 osd 进程的进程号连续。此外，我们的存储节点编号从1~3，需要根据对应的编号自行生成 osd-id。比如我们有3台虚拟机，分别命名为 ceph-1~ceph-3，每个节点上启动3个osd 进程，分别对应 /dev/sdb、/dev/sdc 和 /dev/sdd 三块盘，因此在三台机器的 osd-id 编号需要为：

ceph-1: 0~2
ceph-2: 3~5
ceph-3: 6~8

3.1 代码分析

首先看不做任何操作搭建的 ceph 集群的 osd-id 效果，如下：

我们看到每个节点上3个 osd 进程的 osd-id 并不连续。这是因为 ceph-ansible 中使用的是如下的模块去完成 osd 的创建：

可以看到，前面设置 devices 变量时，会执行批量格式化磁盘，创建 osd 进程，因此这里无法对每个 osd 进程设置osd-id 的值。然而前一个的任务则正是单个单个盘进行格式化，正是我们改造的突破口：

我们先来分析下这个 ceph-volume 模块的内容，代码位置是 library/ceph_volume.py ：

# ...def run_module():module_args = dict(cluster=dict(type='str', required=False, default='ceph'),objectstore=dict(type='str', required=False, choices=['bluestore', 'filestore'], default='bluestore'),action=dict(type='str', required=False, choices=['create', 'zap', 'batch', 'prepare', 'activate', 'list','inventory'], default='create'),  # noqa 4502data=dict(type='str', required=False),data_vg=dict(type='str', required=False),journal=dict(type='str', required=False),journal_vg=dict(type='str', required=False),db=dict(type='str', required=False),db_vg=dict(type='str', required=False),wal=dict(type='str', required=False),wal_vg=dict(type='str', required=False),crush_device_class=dict(type='str', required=False),dmcrypt=dict(type='bool', required=False, default=False),batch_devices=dict(type='list', required=False, default=[]),osds_per_device=dict(type='int', required=False, default=1),journal_size=dict(type='str', required=False, default='5120'),block_db_size=dict(type='str', required=False, default='-1'),block_db_devices=dict(type='list', required=False, default=[]),wal_devices=dict(type='list', required=False, default=[]),report=dict(type='bool', required=False, default=False),containerized=dict(type='str', required=False, default=False),osd_fsid=dict(type='str', required=False),destroy=dict(type='bool', required=False, default=True),)module = AnsibleModule(argument_spec=module_args,supports_check_mode=True)result = dict(changed=False,stdout='',stderr='',rc='',start='',end='',delta='',)if module.check_mode:return result# start executionstartd = datetime.datetime.now()# get the desired actionaction = module.params['action']# will return either the image name or Nonecontainer_image = is_containerized()# Assume the task's status will be 'changed'changed = Trueif action == 'create' or action == 'prepare':# First test if the device has Ceph LVM Metadatarc, cmd, out, err = exec_command(module, list_osd(module, container_image))# list_osd returns a dict, if the dict is empty this means# we can not check the return code since it's not consistent# with the plain output# see: http://tracker.ceph.com/issues/36329# FIXME: it's probably less confusing to check for rc# convert out to json, ansible returns a string...try:out_dict = json.loads(out)except ValueError:fatal("Could not decode json output: {} from the command {}".format(out, cmd), module)  # noqa E501if out_dict:data = module.params['data']result['stdout'] = 'skipped, since {0} is already used for an osd'.format(  # noqa E501data)result['rc'] = 0module.exit_json(**result)# Prepare or create the OSDrc, cmd, out, err = exec_command(module, prepare_or_create_osd(module, action, container_image))elif action == 'activate':if container_image:fatal("This is not how container's activation happens, nothing to activate", module)  # noqa E501# Activate the OSDrc, cmd, out, err = exec_command(module, activate_osd())elif action == 'zap':# Zap the OSDrc, cmd, out, err = exec_command(module, zap_devices(module, container_image))elif action == 'list':# List Ceph LVM Metadata on a devicerc, cmd, out, err = exec_command(module, list_osd(module, container_image))elif action == 'inventory':# List storage device inventory.rc, cmd, out, err = exec_command(module, list_storage_inventory(module, container_image))elif action == 'batch':# Batch prepare AND activate OSDsreport = module.params.get('report', None)# Add --report flag for the idempotency testreport_flags = ['--report','--format=json',]cmd = batch(module, container_image)batch_report_cmd = copy.copy(cmd)batch_report_cmd.extend(report_flags)# Run batch --report to see what's going to happen# Do not run the batch command if there is nothing to dorc, cmd, out, err = exec_command(module, batch_report_cmd)try:report_result = json.loads(out)except ValueError:strategy_changed_in_out = "strategy changed" in outstrategy_changed_in_err = "strategy changed" in errstrategy_changed = strategy_changed_in_out or \strategy_changed_in_errif strategy_changed:if strategy_changed_in_out:out = json.dumps({"changed": False,"stdout": out.rstrip("\r\n")})elif strategy_changed_in_err:out = json.dumps({"changed": False,"stderr": err.rstrip("\r\n")})rc = 0changed = Falseelse:out = out.rstrip("\r\n")result = dict(cmd=cmd,stdout=out.rstrip('\r\n'),stderr=err.rstrip('\r\n'),rc=rc,changed=changed,)if strategy_changed:module.exit_json(**result)module.fail_json(msg='non-zero return code', **result)if not report:# if not asking for a report, let's just run the batch commandchanged = report_result['changed']if changed:# Batch prepare the OSDrc, cmd, out, err = exec_command(module, batch(module, container_image))else:cmd = batch_report_cmdelse:module.fail_json(msg='State must either be "create" or "prepare" or "activate" or "list" or "zap" or "batch" or "inventory".', changed=False, rc=1)  # noqa E501endd = datetime.datetime.now()delta = endd - startdresult = dict(cmd=cmd,start=str(startd),end=str(endd),delta=str(delta),rc=rc,stdout=out.rstrip('\r\n'),stderr=err.rstrip('\r\n'),changed=changed,)if rc != 0:module.fail_json(msg='non-zero return code', **result)module.exit_json(**result)# ...

看这个最重要的 ceph-volume 模块代码的 run_module() 方法。我们这里单个磁盘格式化创建 osd 对应的 action 值为 create ，对应的生成创建 osd 进程命令的方法如下：

# 代码位置：library/ceph_volume.py
# ...def prepare_or_create_osd(module, action, container_image):'''Prepare or create OSD devices'''# get module variablescluster = module.params['cluster']objectstore = module.params['objectstore']data = module.params['data']data_vg = module.params.get('data_vg', None)data = get_data(data, data_vg)journal = module.params.get('journal', None)journal_vg = module.params.get('journal_vg', None)db = module.params.get('db', None)db_vg = module.params.get('db_vg', None)wal = module.params.get('wal', None)wal_vg = module.params.get('wal_vg', None)crush_device_class = module.params.get('crush_device_class', None)dmcrypt = module.params.get('dmcrypt', None)# Build the CLIaction = ['lvm', action]cmd = build_ceph_volume_cmd(action, container_image, cluster)cmd.extend(['--%s' % objectstore])cmd.append('--data')cmd.append(data)if journal and objectstore == 'filestore':journal = get_journal(journal, journal_vg)cmd.extend(['--journal', journal])if db and objectstore == 'bluestore':db = get_db(db, db_vg)cmd.extend(['--block.db', db])if wal and objectstore == 'bluestore':wal = get_wal(wal, wal_vg)cmd.extend(['--block.wal', wal])if crush_device_class:cmd.extend(['--crush-device-class', crush_device_class])if dmcrypt:cmd.append('--dmcrypt')return cmd# ...

找到这个生成的命令位置就好办了，我们只需要在这个函数里面生成的命令部分加上对应的 --osd-id 参数和值即可。接下来就是改造代码了，请看下面的实现。

3.2 代码调整

我们先说明下我们的业务需求，比如我们的 OSD 节点对应的主机名为：ceph-1、ceph-2和ceph-3，每个主机有3裸盘做 osd 的数据盘，这个时候我希望 ceph-1 上的3个 osd 进程的 osd-id 值分别为 0-2，ceph-2 上的为 3-5，ceph-3 上的为 6-8。这样我们计算的时候，需要3个参数：获取当前的主机名对应的编号(ceph-1对应的编号为1，以此类推)、每个主机上启动的 osd 进程数以及当前格式化磁盘在本机所有盘中的顺序编号：

如上图所示，我改造了循环模块，然后添加了3个参数。这三个参数中前两个对应的会在 group/all.yml 中配置，后一个是 for 循环中对应的编号，从0开始。接着我们修改 ceph_volume 模块对应的 python 代码：

# 代码位置：library/ceph_volume.py
# ...def prepare_or_create_osd(module, action, container_image):# ...# 添加的--osd_id选项及参数，如果需要自行添加osd-id选项，需要修改ceph-volume中的代码osd_id = get_osd_id(module)if osd_id >= 0:# 注意，这里也有一个坑，这里不能输入整数，必须是字符串形式，否则执行命令时会报错cmd.extend(['--osd-id', str(osd_id)])# ...def get_osd_id(module):pattern = module.params.get('osd_id_regex', None)osds_per_node = module.params.get('osds_per_node', None)current_num = module.params.get('current_num', 0)rc, out, _ = module.run_command('hostname')if rc == 0 and osds_per_node:hostname = out.rstrip("\r\n")m = re.search(pattern, hostname, flags=0)return (int(m.group(1)) - 1) * osds_per_node + current_num if m else -1# 注意不能用 None 返回并用于if判断，因为0和None的判断可能混淆return -1# ...

这里我在前面的函数中每个创建 osd 进程时手动指定了 --osd-id 参数。接下来我们看看变量的调整：

3.3 实验测试

在执行过程中创建 osd 的时候报了错，这个是 ceph-volume 代码中的一个问题。不知道为什么无法使用自带的 osd-id 值：

我们看下代码就知道为啥过不去了：

def osd_id_available(osd_id):"""Checks to see if an osd ID exists and if it's available forreuse. Returns True if it is, False if it isn't.:param osd_id: The osd ID to check"""if osd_id is None:return Falsebootstrap_keyring = '/var/lib/ceph/bootstrap-osd/%s.keyring' % conf.clusterstdout, stderr, returncode = process.call(['ceph','--cluster', conf.cluster,'--name', 'client.bootstrap-osd','--keyring', bootstrap_keyring,'osd','tree','-f', 'json',],show_command=True)if returncode != 0:raise RuntimeError('Unable check if OSD id exists: %s' % osd_id)output = json.loads(''.join(stdout).strip())osds = output['nodes']osd = [osd for osd in osds if str(osd['id']) == str(osd_id)]# 这句话导致了我们加上--osd-id参数时报错，我们需要调整下这里的代码if osd and osd[0].get('status') == "destroyed":return Truereturn False

最后使用几个 task 来调整 ceph-volume 工具中的一段代码：

最后再次执行，得到的集群的 osd-id 编号分配如下：

改动部分在华为的鲲鹏服务器上也进行了测试，确认改动正常。那三台服务器上各有36块盘，需要启动36个 osd 进程，我们需要耐心确认提取主机编号的表达式即可：

# 代码位置：group_vars/all.yml
# ...
host_num_regex: R10-P01-DN-([0-9]+).gd.cn
# ...

4. 小结

本部分搭建集群工作在有了自动化部署工具后会方便许多，但是对于有些特殊的业务场景我们也需要单独改造下。总之 ceph-ansible 在搭建集群方面帮我们节省了不少时间，我最喜欢的地方就是它包含了搭建集群和摧毁集群两块，如果第一遍没成功，可以完全擦除集群，再重新来一遍。

02-Ceph 集群部署相关推荐

centos7 ceph 集群部署
ceph 一种为优秀的性能.可靠性和可扩展性而设计的统一的.分布式文件系统一.Ceph必备组件 OSD(Object Storage Device) Ceph OSD守护进程(Ceph OSD)的功 ...
「Ceph集群部署」多机离线部署
存储系统:ceph-14.2.22 操作系统:ubuntu-server-18.04 集群组织架构在ceph集群中,安装了ceph软件的并且部署了一系列ceph集群服务的主机被称之为ceph集群节点 ...
ceph集群简单安装部署（Ubuntu14环境）
本次测试.为了快速地安装,所以把目录而非整个硬盘用于 OSD 守护进程.后面的后面的测试还将会有使用硬盘来作为OSD数据盘测试机规划节点名称 IP 地址部署进程系统盘数据盘 ceph1 e ...
ceph (cephadm)集群部署
ceph 集群部署(cephadm) 方法1: (ansible) ceph-ansible使用Ansible部署和管理Ceph集群. (1)ceph-ansible被广泛部署. (2)ceph-an ...
k8s——kubernetes使用rook部署ceph集群
kubernetes使用rook部署ceph集群一:环境准备 1.安装时间服务器进行时间同步所有的kubernetes的集群节点 [root@master ~]# yum -y install n ...
Ceph (1) - 安装Ceph集群方法 1：使用ceph-deploy安装Nautilus版Ceph集群
<OpenShift 4.x HOL教程汇总> 文章目录环境说明 Ceph集群节点说明 Ceph集群主机环境说明用ceph-deploy部署Ceph集群准备节点环境设置环境变量设 ...
Ceph v12.2 Luminous基于ubuntu16.04集群部署
第一次接触ceph集群,感谢官方的指导文档和许多网友提供的教程,糊糊涂涂算是把集群部署起来了.由于Luminous12.2刚发布不久,部署起来跟旧版本还是有些许差别,所以以此做下笔记也给需要的人做下参 ...
部署Ceph集群（块存储，文件系统存储，对象存储）
一前言分布式文件系统(Distributed File System):文件系统管理的物理存储资源不一定直接连接在本地节点上,而是通过计算机网络与节点相连．分布式文件系统的设计基于C/S模式 1, ...
CentOS 7.5安装部署Jewel版本Ceph集群
参考文档 https://www.linuxidc.com/Linux/2017-09/146760.htm https://www.cnblogs.com/luohaixian/p/8087591. ...
Openstack集群-Ceph集群作为存储的部署
1.安装Ceph集群 1.1 设置ceph的yum源 ceph版本:12.2.5 ceph-deploy版本: 2.0.0 注:此处用控制节点部署mod和mgr ,OSD部署在计算节点上 [root@ ...

02-Ceph 集群部署