waterdrop1.x导入clickhouse分布式表-默认方式

先引用一段官方output clickhouse插件中，对分布式表的说明

官方文档地址：https://interestinglab.github.io/seatunnel-docs/#/zh-cn/v1/configuration/output-plugins/Clickhouse

分布式表配置ClickHouse {host = "localhost:8123"database = "nginx"table = "access_msg"cluster = "no_replica_cluster"fields = ["date", "datetime", "hostname", "http_code", "data_size", "ua", "request_time"]
}
根据提供的cluster名称，会从system.clusters表里面获取当前table实际分布在那些节点上。单spark partition的数据会根据随机策略选择某一个ClickHouse节点执行具体的写入操作

从文字说明上可以得知，waterdrop实际上是写的本地表，数据的分配策略是随机

下面来实际测试：

在测试之前，请确保你的clickhouse分布式配置以及完成。可以参考：clickhouse集群模式配置_cakecc2008的专栏-CSDN博客

1、创建表：

-- 创建本地表,在所有节点中都需要执行
DROP TABLE IF EXISTS dw_local.dist_test;
CREATE TABLE  dw_local.dist_test(id String    COMMENT 'id' ,user_name String    COMMENT '用户姓名'
)
engine = MergeTree
primary key (id)
order by  (id)
;truncate table dw_local.dist_test;
-- 创建分布式表
DROP TABLE IF EXISTS dw.dist_test;
CREATE TABLE  dw.dist_test(id String    COMMENT 'id' ,user_name String    COMMENT '用户姓名' )
ENGINE = Distributed(dw_cluster, dw_local, dist_test);select * from  dw_local.dist_test t  ;
select * from  dw.dist_test t  ;

2、准备数据：

vi /tmp/dist_test.csv
id,user_name
1,zhangsan
2,lisi
3,wangwu
4,lili
5,lucy
6,poli
7,lilei
8,hanmeimei
9,liudehua
10,wuyanzu

3、waterdrop配置：

vi /tmp/dist_test.conf

spark {spark.app.name = "Waterdrop"spark.executor.instances = 1spark.executor.cores = 1spark.executor.memory = "1g"spark.sql.catalogImplementation = "hive"
}
input {file {path = "file:///tmp/dist_test.csv"format = "csv"options.header = "true"result_table_name = "dist_test"}
}
filter {repartition {"注释":"对数进进行重新分区。由于测试数据很少，如果不repartition，数据就会都进入同一个节点,后面源码分析的时候会提到"num_partitions = 5}
}
output {clickhouse {host = "10.1.99.191:8123""注释":"因为waterdrop是写本地表，所以这里database需要配置为本地表对于的库名"database = "dw_local"table = "dist_test"cluster = "dw_cluster"username = "user"password = "password"}
}

4、执行导入:

bin/start-waterdrop.sh --master local[1] --deploy-mode client --config /tmp/dist_test.conf

5、查询数据：

-- 节点1
select * from  dw_local.dist_test t  ;Query id: ff2dfdb8-1d58-413a-8fe6-f17992630d1a┌─id─┬─user_name─┐
│ 8  │ hanmeimei │
│ 9  │ liudehua  │
└────┴───────────┘
┌─id─┬─user_name─┐
│ 10 │ wuyanzu   │
│ 5  │ lucy      │
└────┴───────────┘
┌─id─┬─user_name─┐
│ 4  │ lili      │
│ 7  │ lilei     │
└────┴───────────┘
┌─id─┬─user_name─┐
│ 1  │ zhangsan  │
│ 2  │ lisi      │
└────┴───────────┘-- 节点2
select * from  dw_local.dist_test t  ;Query id: ed9ca714-d301-4691-b1bd-7fbf7f34be07┌─id─┬─user_name─┐
│ 3  │ wangwu    │
│ 6  │ poli      │
└────┴───────────┘

很可能你的测试结果和我的不一样，因为它是随机策略。下面从源码层面分析下Clickhouse output插件的集群分配策略。

6、源码分析：

版本1.5.1

从官方文档中我们可以知道：

Output插件调用结构与Filter插件相似。在调用时会先执行checkConfig方法核对调用插件时传入的参数是否正确，然后调用prepare方法配置参数的缺省值以及初始化类的成员变量，最后调用process方法将 Dataset[Row] 格式数据输出到外部数据源。

所以我们重点就看2个方法prepare、process。相关的分析已经写在注释中

package io.github.interestinglab.waterdrop.output.batchclass Clickhouse extends BaseOutput { override def prepare(spark: SparkSession): Unit = {this.jdbcLink = String.format("jdbc:clickhouse://%s/%s", config.getString("host"), config.getString("database"))val balanced: BalancedClickhouseDataSource = new BalancedClickhouseDataSource(this.jdbcLink, properties)val conn = balanced.getConnection.asInstanceOf[ClickHouseConnectionImpl]this.table = config.getString("table")this.tableSchema = getClickHouseSchema(conn, table)if (this.config.hasPath("fields")) {this.fields = config.getStringList("fields")val (flag, msg) = acceptedClickHouseSchema()if (!flag) {throw new ConfigRuntimeException(msg)}}val defaultConfig = ConfigFactory.parseMap(Map("bulk_size" -> 20000,// "retry_codes" -> util.Arrays.asList(ClickHouseErrorCode.NETWORK_ERROR.code),"retry_codes" -> util.Arrays.asList(),"retry" -> 1))if (config.hasPath("cluster")) {              //检查配置文件中是否存在cluster参数this.cluster = config.getString("cluster")this.clusterInfo = getClickHouseClusterInfo(conn, cluster)  //从数据库中获取集群信息，后面在process方法中用到，clusterInfo其实是一个数组if (this.clusterInfo.size == 0) {val errorInfo = s"cloud not find cluster config in system.clusters, config cluster = $cluster"logError(errorInfo)throw new RuntimeException(errorInfo)}logInfo(s"get [$cluster] config from system.clusters, the replica info is [$clusterInfo].")}config = config.withFallback(defaultConfig)retryCodes = config.getIntList("retry_codes")super.prepare(spark)}override def process(df: Dataset[Row]): Unit = {val dfFields = df.schema.fieldNamesval bulkSize = config.getInt("bulk_size")val retry = config.getInt("retry")if (!config.hasPath("fields")) {fields = dfFields.toList}this.initSQL = initPrepareSQL()logInfo(this.initSQL)df.foreachPartition { iter =>         //这里使用Dataset的foreachPartition变量分区，所以所谓的随机，是按分区随机。如果只有一个分区，那么数据就只会进入一个shardvar jdbcUrl = this.jdbcLinkif (this.clusterInfo != null && this.clusterInfo.size > 0) {                 //如果clusterInfo中有数据，就是集群模式//using random policy to select shard when insert dataval randomShard = (Math.random() * this.clusterInfo.size).asInstanceOf[Int]   //随机策略的核心代码，生成一个0~clusterInfo.size的随机数val shardInfo = this.clusterInfo.get(randomShard)                             //跟进上面的随机数，获取其中一个shardval host = shardInfo._4                                                       //从shard中获取host地址，其他信息使用的还是配置文件中的参数val port = getJDBCPort(this.jdbcLink)val database = config.getString("database")                                    //数据库名也是从配置文件中获取，所以配置文件中需要配置本地表对应的库名jdbcUrl = s"jdbc:clickhouse://$host:$port/$database"                          //重新对jdbcUrl赋值，其实主要就是hostlogInfo(s"cluster mode, select shard index [$randomShard] to insert data, the jdbc url is [$jdbcUrl].")} else {logInfo(s"single mode, the jdbc url is [$jdbcUrl].")}val executorBalanced = new BalancedClickhouseDataSource(jdbcUrl, this.properties)val executorConn = executorBalanced.getConnection.asInstanceOf[ClickHouseConnectionImpl]val statement = executorConn.createClickHousePreparedStatement(this.initSQL, ResultSet.TYPE_FORWARD_ONLY)var length = 0while (iter.hasNext) {                                                          //添加数据到缓冲区val row = iter.next()length += 1renderStatement(fields, row, dfFields, statement)statement.addBatch()if (length >= bulkSize) {                                                     //如果缓冲区大小大于等于阈值（默认20000）则执行入库execute(statement, retry)length = 0}}execute(statement, retry)}}
}

7、总结

clickhouse output插件写分布式表的时候，是直接写的本地表，性能上没有什么大大问题
shard的分配策略是随机，核心代码：(Math.random() * this.clusterInfo.size).asInstanceOf[Int]。具体来说是按分区随机，即如果有N个分区，每个分区都会随机获取一次shard，同一个分区必定进入同一个shard。
随机策略导致了两个缺陷：

1）数据分布不均，笔者测试5000万的数据，2个节点，偏差可达到16%；

2）无法指定或预知数据进入哪一个shard，导致后续如果需要join或group时，效率不高。

需要注意的是，在2.x版本没有分布式表写入功能，可能也是基于以上两点原因。

8、思考

由于随机策略在实际应用中并不好用，那么如何解决这个问题呢？

1、修改源代码，增加hash策略，可指定字段进行hash

2、在不修改源代码的情况下，如何实现分布式表的本地写入？