FlinkX脏值处理

在大量数据的传输过程中，必定会由于各种原因导致很多数据传输报错(比如类型转换错误)，这种数据DataX认为就是脏数据。

– by DataX

配置实例

"dirty": {"path": "/tmp","hadoopConfig": {"fs.default.name": "hdfs://flinkhadoop:8020","dfs.nameservices": "ns1","dfs.ha.namenodes.ns1": "flinkhadoop","dfs.namenode.rpc-address.ns1.nn1": "hdfs://flinkhadoop:8020","dfs.ha.automatic-failover.enabled": "true","dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider","fs.hdfs.impl.disable.cache": "true"}}

我这里的hdfs是一个单机。

实现逻辑

FlinkX的脏值处理逻辑是放在写入数据过程中的。按照脏值的定义，在读取过程中，能读取进来的都是正常的值，但是在写入过程中，以目标源的标准，可能读取的值是存在瑕疵的，所以是脏值。

在脏值处理的过程中，脏值处理器的作用肯定是首当其冲，我们先看一下DirtyManager的定义和初始化过程：

DirtyDataManager

全局视角：

创建

public DirtyDataManager(String path, Map<String, Object> configMap, String[] fieldNames, String jobId) {this.fieldNames = fieldNames;location = path + "/" + UUID.randomUUID() + ".txt";this.config = configMap;this.jobId = jobId;}

可以看出来 location 是根据我们配置的路径+一个uuid生成的txt,其实这样在查看起来的时候不是很方便。

初始化

public void open() {try {FileSystem fs = FileSystemUtil.getFileSystem(config, null);Path path = new Path(location);stream = fs.create(path, true);} catch (Exception e) {throw new RuntimeException("Open dirty manager error", e);}}// -------------------- FileSystem ---------------------------public static FileSystem getFileSystem(Map<String, Object> hadoopConfigMap, String defaultFs) throws Exception {if(isOpenKerberos(hadoopConfigMap)){return getFsWithKerberos(hadoopConfigMap, defaultFs);}Configuration conf = getConfiguration(hadoopConfigMap, defaultFs);setHadoopUserName(conf);return FileSystem.get(getConfiguration(hadoopConfigMap, defaultFs));}

flinkX的脏值是存放在hadoop上面的。

脏值写入

public String writeData(Row row, WriteRecordException ex) {String content = RowUtil.rowToJson(row, fieldNames);String errorType = retrieveCategory(ex);String line = StringUtils.join(new String[]{content,errorType, gson.toJson(ex.toString()), DateUtil.timestampToString(new Date()) }, FIELD_DELIMITER);try {// stream.write(line.getBytes(StandardCharsets.UTF_8));stream.write(LINE_DELIMITER.getBytes(StandardCharsets.UTF_8));DFSOutputStream dfsOutputStream = (DFSOutputStream) stream.getWrappedStream();dfsOutputStream.hsync(syncFlags);return errorType;} catch (IOException e) {throw new RuntimeException(e);}}private String retrieveCategory(WriteRecordException ex) {Throwable cause = ex.getCause();if(cause instanceof NullPointerException) {return ERR_NULL_POINTER;}for(String keyword : PRIMARY_CONFLICT_KEYWORDS) {if(cause.toString().toLowerCase().contains(keyword)) {return ERR_PRIMARY_CONFLICT;}}return ERR_FORMAT_TRANSFORM;}

获取脏值数据内容
获取脏值类型: NPE、主键重复、其它错误（统称为转换错误）
将数据内容和错误原因进行拼接，分割符为 \u0001
将拼接后的数据以utf-8编码以及换行符\n写入到hdfs中
通过hsync刷入，根据UPDATE_LENGTH策略刷入

hsync的语义是：client端所有的数据都发送到副本的每个datanode上，并且datanode上的每个副本都完成了posix中fsync的调用，也就是说操作系统已经把数据刷到磁盘上（当然磁盘也可能缓冲数据）；需要注意的是当调用fsync时只有当前的block会刷到磁盘中，要想每个block都刷到磁盘，必须在创建流时传入Sync标示。

UPDATE_LENGTH: 同步到DataNodes时，还更新NameNode中的元数据（块长度）。

脏值写入时机

在写入每一行数据writeSingleRecord的时候，进行脏值的捕获

protected void writeSingleRecord(Row row) {if(errorLimiter != null) {errorLimiter.acquire();}try {writeSingleRecordInternal(row);if(!restoreConfig.isRestore() || isStreamButNoWriteCheckpoint()){numWriteCounter.add(1);snapshotWriteCounter.add(1);}} catch(WriteRecordException e) {// 写入错误限流器saveErrorData(row, e);// 更新指标以及持久化存储脏值updateStatisticsOfDirtyData(row, e);// 总记录数加1numWriteCounter.add(1);snapshotWriteCounter.add(1);if(dirtyDataManager == null && errCounter.getLocalValue() % LOG_PRINT_INTERNAL == 0){LOG.error(e.getMessage());}if(DtLogger.isEnableTrace()){LOG.trace("write error row, row = {}, e = {}", row.toString(), ExceptionUtil.getErrorMessage(e));}}}

private void updateStatisticsOfDirtyData(Row row, WriteRecordException e){if(dirtyDataManager != null) {String errorType = dirtyDataManager.writeData(row, e);if (ERR_NULL_POINTER.equals(errorType)){nullErrCounter.add(1);} else if(ERR_FORMAT_TRANSFORM.equals(errorType)){conversionErrCounter.add(1);} else if(ERR_PRIMARY_CONFLICT.equals(errorType)){duplicateErrCounter.add(1);} else {otherErrCounter.add(1);}}}

这代码逻辑确实和我的逻辑稍微有些区别，为什么会在这里进行存储。。。。

应该逻辑应该分离的。将dirtyDataManager.writeData(row, e)放在上一个saveErrorData方法中可能更合适。

参考 https://github.com/DTStack/flinkx/issues/220

脏值实例测试

脏值文件实例

根据dirty配置，初始化hadoop的连接,并创建对应文件，如我们这里配置的path是:/tmp/flinkx/bond_info_mongodb_to_mysql,如我们配置的是4个处理器。在对应的hdfs上面，有四个文件：

感觉官方需要对这个作业存储位置进行一些处理：

脏值模拟

我们模拟将mysql对应的表的string=>bigint,这样肯定会在转换中发生错误。

{"bond_name":"xxxx","bond_stop_time":"xxx","bond_time_limit":"xx","bond_type":"xxx","plan_issued_quantity":"xx","publish_expire_time":"xxx","publish_time":"xx","publisher_name":"xxx","real_issued_quantity":"14","start_cal_interest_time":"xx","inst_code":"x":"x","city_code":"x","area_code":"x","input_date":x,"input_time":x}conversion"com.dtstack.flinkx.exception.WriteRecordException: Incorrect integer value: 'xxxx' for column 'bond_type' at row 1\njava.sql.SQLException: Incorrect integer value: 'xxxx' for column 'bond_type' at row 1"2020-05-24 17:33:15

可以看出来是类型转换错误，它会把错误数据和错误原因都进行存储，并且根据u0001进行分割。

总结

本文对脏值的定义，以及FlinkX的处理进行详细的分解，并进行了相关的测试，与实例展示。从本文中可以了解到hdfs的hsync与hdfs的基本配置。