18_clickhouse副本同步与高可用功能验证,分布式表与集群配置,数据副本与复制表,ZooKeeper整合,创建复制表,副本同步机制,数据原子写入与去重,负载平衡策略,案例(学习笔记)

24.副本同步与高可用功能验证
24.1.分布式表与集群配置
24.2.数据副本与复制表
24.3.ZooKeeper整合
24.4.创建复制表
24.5.副本同步机制
24.6.数据原子写入与去重
24.7.负载平衡策略
24.8.案例

24.副本同步与高可用功能验证

此部分，上接：https://blog.csdn.net/tototuzuoquan/article/details/111027342

24.1.分布式表与集群配置

分布式表基于Distributed引擎创建，在多个分片上运行分布式查询。
读取是自动并行化的，可使用远程服务器上的索引（如果有）。
数据在请求的本地服务器上尽可能地被部分处理。例如，对于GROUP BY查询，数据将在远程服务器上聚合，聚合函数的中间状态将发送到请求服务器，然后数据将进一步聚合。

创建分布式表：

ENGINE = Distributed(cluster_name, db_name, table_name[, sharding_key[, policy_name]])

参数：
cluster_name:集群名称。
db_name:数据库名称，可使用常量表达式：currentDatabase()。
table_name: 各分片上的表名称。
sharding_key: (可选)分片的key，可设置为rand()。
policy_name: (可选）策略名称，用于存储异步发送的临时文件。

例如下面的/etc/metrika.xml的一部分内容：

<remote_servers><logs><shard><weight>1</weight><internal_replication>false</internal_replication><replica><host>example01-01-1</host><port>9000</port></replica><replica><host>example01-01-2</host><port>9000</port></replica></shard><shard><weight>2</weight><internal_replication>false</internal_replication><replica><host>example01-02-1</host><port>9000</port></replica><replica><host>example01-02-2</host><secure>1</secure><port>9000</port></replica></shard></logs>
</remote_servers>

这里定义了一个名为logs的集群名称，它有两个分片（shard）组成，每个分片包含两个副本（replica）。

分片是包含数据的不同服务器（要读取所有数据，必须访问所有分片）。

副本是存储复制数据的服务器（要读取所有数据，访问该分片上的任意一个副本上的数据即可）。

1.weight : 可选，写入数据时分片的权重，建议忽略该配置。
2.internal_repliacation : 可选，同一时刻是否只将数据写入其中一个副本。默认值：false（将数据写入所有副本），建议设置为true。写一个即可。避免重复写。
3.副本配置：配置每个Server的信息，必须参数:host和port，可选参数：user、password、secure和compression。
（1）、host ：远程服务器地址。支持IPv4和IPv6。也可指定域名，更改域名解析需重启服务。
（2）、port ：消息传递的TCP端口。配置文件的tcp_port指定的端口，通常设置为 9000。
（3）、user ：用于连接到服务的用户名称。默认值：true。在users.xml文件中配置了访问权限。
（4）、password：用于连接到远程服务的密码。默认值：空字符串。
（5）、secure ：使用ssl进行连接，通常还应该定义port=9440。
（6）、compression ：使用数据压缩。默认值：true。

24.2.数据副本与复制表

只有MergeTree系列引擎支持数据副本，支持副本的引擎是在MergeTree引擎名称的前面加上前缀 Replicated。
副本是表级别的而不是整个服务器级别的，因此服务器可以同时存储复制表和非复制表。
副本不依赖于分片，每个分片都有自己独立的副本。

副本表如：

ReplicatedMergeTree
ReplicatedSummingMergeTree
ReplicatedReplacingMergeTree
ReplicatedAggregatingMergeTree
ReplicatedCollapsingMergeTree
ReplicatedVersionedCollapsingMergeTree
ReplicatedGraphiteMergeTree

24.3.ZooKeeper整合

ClickHouse使用Apache ZooKeeper来存储副本元信息, 在配置文件设置 zookeeper相关的参数。
ClickHouse在创建复制表的时候指定Zookeeper的目录，指定的目录会在建表时自动创建。
如果ClickHouse的配置文件未配置ZooKeeper，则无法创建复制表，并且任何存量的复制表都将是只读的。
对本地复制表的查询，不会使用ZooKeeper，其查询速度和非复制表一样快。

本地复制表的数据插入，针对每个数据块（一个块最多有 max_insert_block_size = 1048576条记录），会通过几个事务将大约十个条目添加到Zookeeper。因此，与非复制表相比，复制表的INSERT操作等待时间稍长。

<zookeeper><node index="1"><host>example1</host><port>2181</port></node><node index="2"><host>example2</host><port>2181</port></node><node index="3"><host>example3</host><port>2181</port></node>
</zookeeper>

24.4.创建复制表

复制表的引擎要以Replicated为前缀，例如：ReplicatedMergeTree。

CREATE TABLE table_name
(EventDate DateTime,CounterID UInt32,UserID UInt32
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/table_name', '{replica}')
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID);

引擎参数包含了变量，这些变量是在配置文件的”macros”部分配置的，例如：

<macros><layers>05</layers><shard>02</shard><replica>clickhouse1</replica>
</macros>

Replicated*MergeTree引擎参数：
zoo_path : ZooKeeper中表的路径。
replica_name : ZooKeeper中的副本名称。

1.第一个参数ZooKeeper路径组成：
（1）、通用前缀：/clickhouse/tables/,建议复制表都使用类似这样的前缀。
（2）、分片标识符：{layer}-{shard},在本示例中，分片标识符有两部分组成，只要保证分片标识符能唯一标识一个分片即可。
（3）、ZooKeeper节点名称：table_name。节点名称最好与表名相同，节点名称在定义后不会更改，即使执行表的重命名操作。

2.第二个参数是副本名称，用于标识同一个分片的不同副本。副本名称只需要在每个shard中唯一即可。
上面的示例中，复制引擎的参数使用了变量替换。ClickHouse也支持使用显示的参数。在这种情况下，不能使用分布式的DDL查询（ON CLUSTER）。建议使用变量替换的方式传入参数，降低出错概率。

在每个副本服务器上运行CREATE TABLE语句，如果该分片的表在其他节点已经创建且有数据，则该新副本自动同步其他副本的数据。

24.5.副本同步机制

复制是多主异步的。
INSERT语句（以及ALTER）可在任意可用的服务器上执行。数据首先插入到本地的服务器（即运行查询的服务器），然后数据被复制到其他服务器。

由于复制是异步的，所以最近插入的数据出现在其他副本上会有一定的延迟。

如果部分副本不可用，则在它们可用时写入数据。

如果副本可用，则等待的时间是通过网络传输压缩数据块所耗费的时间。

默认情况下， INSERT操作只需等待一个副本写入成功后返回。如果仅将数据成功写入一个副本，并且该副本的服务器不再存在，则存储的数据将丢失。要启动来自多个副本的写入确认机制，使用insert_quorum选项。

24.6.数据原子写入与去重

INSERT查询按照数据块插入数据，每个数据块最多max_insert_block_size(默认 max_insert_block_size = 1048576)条记录。换言之，如果INSERT插入少于1048576条记录，则插入操作是原子的。单个数据块的写入是原子的。

数据块是去重的。对于同一数据块的多次写入（相同大小的的数据块，包含相同的行以及相同的顺序），该块仅写入一次。在出现网口故障等异常情况下，客户端应用程序不知道数据是否已将数据成功写入数据库，因此可以简单地重复执行INSERT查询。相同的数据发送到哪个副本进行插入并不重要，INSERT是幂等的。数据去重可通过参数 insert_deduplicate控制，默认为0(开启去重)。

在复制过程中，只有插入的源数据通过网络传输。进一步的数据转换（合并）会在所有副本上以相同的方式进行处理。这样可以最大限度减少网络带宽占用，这意味着当副本位于不同的数据中心时，复制的效果也很好。

ClickHouse内部监控副本的数据同步，并能够在发生故障后恢复。故障转义是自动的（对于数据的微小差异）或半自动的（当数据的差异太大时，这可能表示配置错误）。

ClickHouse内部监控副本上的数据同步，并能够在发生故障后恢复。故障转移是自动的（对于数据的微小差异）或半自动的（当数据差异太大时，这可能表示配置错误）。

24.7.负载平衡策略

执行分布式查询时，首先计算分片的每个副本的错误数，然后将查询发送至最少错误的副本。如果没有错误或者错误数相同，则按如下的策略查询数据：
1.random(默认) ：将查询发送至任意一个副本。
2.nearest_hostname : 将查询发送至主机名最相似的副本。
3.in_order : 将查询按配置文件中的配置顺序发送至副本。
4.first_or_random : 选择第一个副本，如果第一个副本不可用，随机选择一个可用的副本。

设置策略的方式：

set load_balancing = 'first_or_random';

24.8.案例

1.在所有节点执行如下语句：
创建本地复制表：

CREATE TABLE table_local on cluster mycluster
(EventDate DateTime,CounterID UInt32,UserID UInt32
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/table_local', '{replica}')
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID);

执行效果图：
在clickhouse1节点执行的效果图如下：

clickhouse1 :) CREATE TABLE table_local on cluster mycluster
:-] (
:-] EventDate DateTime,
:-] CounterID UInt32,
:-] UserID UInt32
:-] ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/table_local', '{replica}')
:-] PARTITION BY toYYYYMM(EventDate)
:-] ORDER BY (CounterID, EventDate, intHash32(UserID))
:-] SAMPLE BY intHash32(UserID);CREATE TABLE table_local ON CLUSTER mycluster
(`EventDate` DateTime,`CounterID` UInt32,`UserID` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/table_local', '{replica}')
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)┌─host────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 192.168.106.103 │ 9000 │      0 │       │                   3 │                2 │
│ 192.168.106.105 │ 9000 │      0 │       │                   2 │                2 │
└─────────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘
┌─host────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 192.168.106.106 │ 9000 │      0 │       │                   1 │                1 │
└─────────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘
┌─host────────────┬─port─┬─status─┬─error─┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 192.168.106.104 │ 9000 │      0 │       │                   0 │                0 │
└─────────────────┴──────┴────────┴───────┴─────────────────────┴──────────────────┘4 rows in set. Elapsed: 0.419 sec. clickhouse1 :)

在clickhouse2-4上执行后的效果如下（提示已经存在了）：

clickhouse2 :) CREATE TABLE table_local on cluster mycluster
:-] (EventDate DateTime,
:-] CounterID UInt32,
:-] UserID UInt32
:-] ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/table_local', '{replica}')
:-] PARTITION BY toYYYYMM(EventDate)
:-] ORDER BY (CounterID, EventDate, intHash32(UserID))
:-] SAMPLE BY intHash32(UserID);CREATE TABLE table_local ON CLUSTER mycluster
(`EventDate` DateTime,`CounterID` UInt32,`UserID` UInt32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/table_local', '{replica}')
PARTITION BY toYYYYMM(EventDate)
ORDER BY (CounterID, EventDate, intHash32(UserID))
SAMPLE BY intHash32(UserID)┌─host────────────┬─port─┬─status─┬─error─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 192.168.106.103 │ 9000 │     57 │ Code: 57, e.displayText() = DB::Exception: Table default.table_local already exists. (version 20.9.3.45 (official build)) │                   3 │                2 │
└─────────────────┴──────┴────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┴──────────────────┘
┌─host────────────┬─port─┬─status─┬─error─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─num_hosts_remaining─┬─num_hosts_active─┐
│ 192.168.106.104 │ 9000 │     57 │ Code: 57, e.displayText() = DB::Exception: Table default.table_local already exists. (version 20.9.3.45 (official build)) │                   2 │                0 │
│ 192.168.106.105 │ 9000 │     57 │ Code: 57, e.displayText() = DB::Exception: Table default.table_local already exists. (version 20.9.3.45 (official build)) │                   1 │                0 │
│ 192.168.106.106 │ 9000 │     57 │ Code: 57, e.displayText() = DB::Exception: Table default.table_local already exists. (version 20.9.3.45 (official build)) │                   0 │                0 │
└─────────────────┴──────┴────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────────────┴──────────────────┘
← Progress: 4.00 rows, 720.00 B (26.16 rows/s., 4.71 KB/s.)  99%
Received exception from server (version 20.9.3):
Code: 57. DB::Exception: Received from localhost:9000. DB::Exception: There was an error on [192.168.106.103:9000]: Code: 57, e.displayText() = DB::Exception: Table default.table_local already exists. (version 20.9.3.45 (official build)). 4 rows in set. Elapsed: 0.154 sec. clickhouse2 :)

通过上面的案例可以知道，只要在一个节点上创建了副本表之后，在其它节点上也已经存在了。

创建分布式表：

CREATE TABLE table_distributed as table_local ENGINE = Distributed(mycluster, default, table_local, rand());

(1)、验证副本的复制
在clickhouse1上，对本地表操作。

clickhouse1 :) insert into table_local values('2020-03-11 12:12:33',22,37);INSERT INTO table_local VALUESOk.1 rows in set. Elapsed: 0.017 sec. clickhouse1 :) select * from table_local;SELECT *
FROM table_local┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-11 12:12:33 │        22 │     37 │
└─────────────────────┴───────────┴────────┘1 rows in set. Elapsed: 0.002 sec. clickhouse1 :)

在clickhouse2上（clickhouse1的分片副本节点）上，验证数据是否同步：

clickhouse2 :) select * from table_local;SELECT *
FROM table_local┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-11 12:12:33 │        22 │     37 │
└─────────────────────┴───────────┴────────┘1 rows in set. Elapsed: 0.019 sec. clickhouse2 :)

在clickhouse3-4上执行（即shard2上）。发现查询不到结果，效果如下：

clickhouse3 :) select * from table_local;SELECT *
FROM table_localOk.0 rows in set. Elapsed: 0.005 sec. clickhouse3 :)
clickhouse4 :) select * from table_local;SELECT *
FROM table_localOk.0 rows in set. Elapsed: 0.005 sec. clickhouse4 :)

（3）、验证集群的功能
在任意节点查看分布式表的数据（都将出现下面的效果）。

clickhouse1 :) select * from table_distributed;SELECT *
FROM table_distributed┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-11 12:12:33 │        22 │     37 │
└─────────────────────┴───────────┴────────┘1 rows in set. Elapsed: 0.009 sec. clickhouse1 :)

在任意一个节点往分布式表里面插入5条数据：

insert into table_distributed values('2020-03-11 12:12:31', 21, 1);    clickhouse4上执行
insert into table_distributed values('2020-03-12 12:12:32', 22, 2);    clickhouse4上执行
insert into table_distributed values('2020-03-13 12:12:33', 23, 3);    clickhouse3上执行
insert into table_distributed values('2020-03-14 12:12:34', 24, 4);    clickhouse3上执行
insert into table_distributed values('2020-03-15 12:12:35', 25, 5);    clickhouse2上执行

然后在任意一台机器上执行：

select * from table_distributed;

都可以看到：

clickhouse1 :) select * from table_distributed;SELECT *
FROM table_distributed┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-11 12:12:33 │        22 │     37 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-12 12:12:32 │        22 │      2 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-14 12:12:34 │        24 │      4 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-15 12:12:35 │        25 │      5 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-11 12:12:31 │        21 │      1 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-13 12:12:33 │        23 │      3 │
└─────────────────────┴───────────┴────────┘6 rows in set. Elapsed: 0.009 sec. clickhouse1 :)

然后，分别在两个分片的主机上查询本地表：
在clickhouse1-2上（shard1）发现的效果是：

clickhouse1 :) select * from table_local;SELECT *
FROM table_local┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-11 12:12:33 │        22 │     37 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-12 12:12:32 │        22 │      2 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-14 12:12:34 │        24 │      4 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-15 12:12:35 │        25 │      5 │
└─────────────────────┴───────────┴────────┘
↘ Progress: 4.00 rows, 48.00 B (1.76 thousand rows/s., 21.09 KB/s.)  4 rows in set. Elapsed: 0.004 sec. clickhouse1 :)

在clickhouse3-4上（shard2）发现的效果是：

clickhouse3 :) select * from table_local;SELECT *
FROM table_local┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-11 12:12:31 │        21 │      1 │
└─────────────────────┴───────────┴────────┘
┌───────────EventDate─┬─CounterID─┬─UserID─┐
│ 2020-03-13 12:12:33 │        23 │      3 │
└─────────────────────┴───────────┴────────┘2 rows in set. Elapsed: 0.002 sec. clickhouse3 :)

可以看到，使用分布式表插入数据，数据分散到不同分片（shard）的本地表。