spark写入elasticsearch限流

文章目录

1. spark 批量写入es
2. java-spark写入elasticsearch
3. es_hadoop的源码拓展
- 1. MyEsSparkSQL
- 2. MyEsDataFrameWriter

1. spark 批量写入es

正常情况下，我们的spark任务有写入es的需求的时候，我们都是使用ES_Hadoop。参考官方的这里，选择适合自己的版本，如果是hive,spark等都有用到的话可以直接配置

<dependency><groupId>org.elasticsearch</groupId><artifactId>elasticsearch-hadoop</artifactId><version>7.1.1</version>
</dependency>

因为我们这里只是用到了spark，spark的版本是2.3 ， scale 是2.11 ，elasticsearch是7.1.1所以只引入spark的包即可。

<dependency><groupId>org.elasticsearch</groupId><artifactId>elasticsearch-spark-20_2.11</artifactId><version>7.1.1</version></dependency>

2. java-spark写入elasticsearch

java写入es的代码可以这样

@Data
public class UserProfileRecord {public String uid;public String want_val;
}

 SparkConf sparkConf = new SparkConf().setAppName(JOB_NAME).set(ConfigurationOptions.ES_NODES, esHost).set(ConfigurationOptions.ES_PORT, esPort).set(ConfigurationOptions.ES_NET_HTTP_AUTH_USER, esUser).set(ConfigurationOptions.ES_NET_HTTP_AUTH_PASS, esPass).set(ConfigurationOptions.ES_BATCH_SIZE_ENTRIES, "500").set(ConfigurationOptions.ES_MAPPING_ID, "uid");SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();Dataset<Row> wantedCols = sparkSession.read().parquet(path);Dataset<UserProfileRecord> searchUserProfile = wantedCols.mapPartitions(new MapPartitionsFunction<Row, UserProfileRecord>() {@Overridepublic Iterator<UserProfileRecord> call(Iterator<Row> input) throws Exception {List<UserProfileRecord> cleanProfileList = new LinkedList<>();while (input.hasNext()) {UserProfileRecord aRecord = new UserProfileRecord();.........cleanProfileList.add(aRecord);}return cleanProfileList.iterator();}}, Encoders.bean(UserProfileRecord.class));EsSparkSQL.saveToEs(searchUserProfile.repartition(3), this.writeIndex);

这里因为es当前只有3个节点，所以用了一个repartition来将写入es的task数变成3个，减小对es的压力，在实际的使用过程中主片的写入速度能够达到平均3w/s，但是当任务产出的数据量比较大的时候写入的时间会比较长，还是会对当前的es集群产生比较大的影响,导致部分查询超时。
查找了很多官方的文档，发现能够调整的很有限，一般都是调整partition的数量和ConfigurationOptions.ES_BATCH_SIZE_ENTRIES 来throttle写入es的速度。我这边各种试探，收效甚微。
本来想用elasticsearch的java-client直接做rest请求的（这样就可以控制速速了）,但是翻了一下es_hadoop的源码，发现她用的是tranport-client（是es内部通信使用的基于tcp的协议封装）那肯定比http类型的rest更高效啊，而且还有很多partition和es索引的replica的映射关系，想着应该是做了很多优化。所以还是用es_hadoop来做吧，没有办法了，只能看看改改源码了。

3. es_hadoop的源码拓展

增加了两个scala文件（强上scala