ClickHouse数据迁移工具之clickhouse-copier

clickhouse需要从单节点迁移至副本集群中,表结构统一修改为副本表

网上搜到的迁移方式大致为三种。

一、拷贝数据目录

操作流程
1. 在源集群的硬盘上打包好对应数据库或表的 data 和 metadata 数据
2. 拷贝到目标集群对应的目录
3. 重启 clickhouse-server

二、remote函数

INSERT INTO <local_database>.<local_table>
SELECT * FROM remote('remote_clickhouse_addr', <remote_database>, <remote_table>, '<remote_user>', '<remote_password>')

三、clickhouse-copier

clickhouse-copier是 ClickHouse 官方提供的一款数据迁移工具，可用于把表从一个集群迁移到另一个（也可以是同一个）集群。Clickhouse-copier 使用 Zookeeper 来管理同步任务，可以同时运行多个 clickhouse-copier 实例

由于数据迁移后的每个表需要在集群上创建一份副本集，因此将迁移后的所有表引擎需要修改为ReplicatedMergeTree家族。

这里使用clickhouse-copier和remote（用来同步没有分区的表）进行数据迁移

clikhouse-copier

优点
- 支持并发同步，可以运行多个clickhouse-copier实例
- 使用zookeeper同步写入状态，支持增量同步
- 可以在配置文件内重新定义写入的表引擎
- 表名与数据库名不需要相同
- 可指定source与sink的shard,replica
缺点
- 配置文件较为繁琐，每个不同实例clickhouse-copier任务的配置文件都需要上传至zookeeper相应实例节点上
- 慢。相比于直接迁移文件，clickhouse-copier相当于多次执行insert into操作，因此迁移速度较慢
使用clickhouse-copier需要注意的点
- source表需要定义了partition,不然任务同步报错
- zookeeper内存储的是已处理过的partition信息，再次运行clickhouse-copier,同步的仅是未处理过的partition数据。对于已经同步后的partition，即使partition内有新增数据，也不会同步到目标集群上。
- clickhouse-copier无法同步普通视图，但可以同步物化视图,若同步物化视图，需要在source的表名前加上.inner.,不然会提示找不到table
- 使用Clickhouse-copier需要借助zookeeper，为减少网络流量，建议clickhouse-copier在源数据所在的服务器上运行。

clickhouse-copier --config test.xml --task-path /clickhouse/copier_task2/download_task

常用参数

daemon — 后台运行copier工具，进程将在后台启动。 config — zookeeper.xml的存放路径，用来连接zookeeper集群。task-path — zookeeper上的存储节点路径，例如：/clickhouse/copier_task/task1;该路径中的内容用来存储任务，以及多个copier进程间的协调信息，建议不同的数据任务，路径不要重名，例如：/clickhouse/copier_task/task2,task3,task4,或者每天同步做一次数据copy的话，也可以以当天日期命名，task-2021-01-27，但同一任务的不同copier进程要保持一致的配置路径。task-file — 指向配置了任务的配置文件，例如：copy-job.xml,该文件内容会上传到zookeeper的/clickhouse/copier_task/task1/description节点。task-upload-force — 若设置为true,那么将根据task-file文件的内容，强制刷新覆盖上个参数提到的zookeeper的description节点。base-dir — 会存储一些日志以及相关的辅助型文件，copier工具进程启动后，会在$base-dir创建copier_YYYYMMHHSS_<PID>格式的子目录（日志文件会在该子目录下，以及辅助型分布式表的相关信息在data目录下），若没有传该参数，则在copier运行的当前目录创建。

clickhouse-copier同步流程

1.创建zookeeper.xml配置信息

<yandex><logger><level>trace</level><log>./log/clickhouse-copier/copier/log.log</log><errorlog>./log/clickhouse-copier/copier/log.err.log</errorlog><size>100M</size><count>3</count></logger><zookeeper><node><host>data-hadoop-1</host><port>2181</port></node><node><host>data-hadoop-2</host><port>2181</port></node><node><host>data-hadoop-3</host><port>2181</port></node><node><host>data-hadoop-4</host><port>2181</port></node><node><host>data-hadoop-5</host><port>2181</port></node></zookeeper>
</yandex>

2.在zookeeper上创建任务路径

zkCli -server hadoop-110:2181 create /clickhouse/copier_task

zkCli -server hadoop-110:2181 create /clickhouse/copier_task/${task_name}

3.创建对应的${task_name}.xml配置信息

<yandex><!-- remote_servers与/etc/clickhouse-server/config.xml下的remote_servers相同 --><remote_servers><!-- 需要同步的source集群的集群名 --><test_shard_localhost><shard><internal_replication>false</internal_replication><replica><host>data-hadoop-5</host><port>9000</port><user>default</user><password>test</password></replica></shard></test_shard_localhost><!-- 目标集群配置信息 --><bigdata_one_shard_two_replication><shard><internal_replication>true</internal_replication><replica><host>data-clickhouse-1</host><port>9000</port><user>default</user></replica><replica><host>data-clickhouse-2</host><port>9000</port><user>default</user></replica></shard></bigdata_one_shard_two_replication></remote_servers><!-- copier最大进程数 --><max_workers>4</max_workers><!-- 同步表信息 --><tables><!-- table_test 分类标签,仅用来区分不同的表同步任务 --><table_test><!-- pull信息,被同步的表位置 --><cluster_pull>test_shard_localhost</cluster_pull><database_pull>dw</database_pull><table_pull>test</table_pull><!-- push信息 --><cluster_push>bigdata_one_shard_two_replication</cluster_push><database_push>dw</database_push><table_push>test</table_push><!-- 目的表的engine信息，同步时会根据此engine配置创建表 --><engine>ENGINE=ReplicatedMergeTree('/clickhouse/tables/{shard}/dw/test', '{replica}')PARTITION BY toYYYYMMDD(toDateTime(trace_download_ts))ORDER BY tuple()SETTINGS index_granularity = 8192 </engine><!-- 分布式表shard --><sharding_key>01</sharding_key><!-- 查询源数据时可以添加的过滤条件 --><!--<where_condition> CounterID != 0 </where_condition>--><!-- 指定同步的具体分区，若无此参数，则默认同步全部分区,partition值为system.part对应表的partition column --><!-- <enabled_partitions></enabled_partitions>--></table_test><!--<table_2>....</table_2>--></tables>
</yandex>

4.将相应的${task_name}.xml上传到相应zookeeper路径的${path}/description节点上,有两种方法。

一种是直接将文件内容上传至description节点上

zkCli -server hadoop-110:2181 create /clickhouse/copier_task/${task_name}/description "`cat ${task_name}.xml`"
一种是利用参数上传文件

clickhouse-copier --config zookeeper.xml --task-path /clickhouse/copier_task/${task_name} --task-file=${task_name}.xml

5.执行clickhouse-copier

clickhouse-copier --config zookeeper.xml --task-path /clickhouse/copier_task/${task_name}