1.启动不起来

查看JobManager日志:

WARN  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink@t-sha1-flk-01:6123/), Path(/user/jobmanager)]at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)at java.lang.Thread.run(Thread.java:748)

解决方案:/etc/hosts中配置的主机名都是小写,但是在Flink配置文件(flink-config.yaml、masters、slaves)中配置的都是大写的hostname,将flink配置文件中的hostname都改为小写或者IP地址

2.运行一段时间退出

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 4 for operator Compute By Event Time (1/12).}at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 4 for operator Compute By Event Time (1/12).... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Size of the state is larger than the maximum permitted memory-backed state. Size=7061809 , maxSize=5242880 . Consider using a different state backend, like the File System State backend.at java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:192)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)... 5 moreSuppressed: java.lang.Exception: Could not properly cancel managed keyed state future.at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023)at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961)... 5 moreCaused by: java.util.concurrent.ExecutionException: java.io.IOException: Size of the state is larger than the maximum permitted memory-backed state. Size=7061809 , maxSize=5242880 . Consider using a different state backend, like the File System State backend.at java.util.concurrent.FutureTask.report(FutureTask.java:122)at java.util.concurrent.FutureTask.get(FutureTask.java:192)at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)at org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85)at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88)... 7 moreCaused by: java.io.IOException: Size of the state is larger than the maximum permitted memory-backed state. Size=7061809 , maxSize=5242880 . Consider using a different state backend, like the File System State backend.at org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory.checkSize(MemCheckpointStreamFactory.java:64)at org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory$MemoryCheckpointOutputStream.closeAndGetBytes(MemCheckpointStreamFactory.java:144)at org.apache.flink.runtime.state.memory.MemCheckpointStreamFactory$MemoryCheckpointOutputStream.closeAndGetHandle(MemCheckpointStreamFactory.java:125)at org.apache.flink.runtime.checkpoint.AbstractAsyncSnapshotIOCallable.closeStreamAndGetStateHandle(AbstractAsyncSnapshotIOCallable.java:100)at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:351)at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend$1.performOperation(HeapKeyedStateBackend.java:329)at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at org.apache.flink.runtime.state.heap.HeapKeyedStateBackend.snapshot(HeapKeyedStateBackend.java:372)at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:397)at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1162)at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1094)at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:654)at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:590)at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:543)at org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyCheckpoint(BarrierBuffer.java:378)at org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:281)at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:183)at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:213)at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)... 1 more[CIRCULAR REFERENCE:java.io.IOException: Size of the state is larger than the maximum permitted memory-backed state. Size=7061809 , maxSize=5242880 . Consider using a different state backend, like the File System State backend.]

解决方案:

状态存储,默认是在内存中,改为存储到HDFS中:

state.backend.fs.checkpointdir: hdfs://t-sha1-flk-01:9000/flink-checkpoints

3.长时间运行后,多次重启

AsynchronousException{java.lang.Exception: Could not materialize checkpoint 1488 for operator Compute By Event Time -> (MonitorData, MonitorDataMapping, MonitorSamplingData) (6/6).}
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:948)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 1488 for operator Compute By Event Time -> (MonitorData, MonitorDataMapping, MonitorSamplingData) (6/6).
... 6 more
Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /flink-checkpoints/8c274785f1ab027e6146a59364be645f/chk-1488/2c612f30-c57d-4ede-9025-9554ca11fd12 could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1628)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3121)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3045)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:493)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213)

查看hdfs日志,

WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 2 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK], removed=[DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})

搭建的Flink使用HDFS作为CheckPoint的存储,当flink重启时,原来的checkpoint没有用了,我就手动给删了,不知道和这个有没有关系,为了不继续报异常,便重启了Flink、HDFS,重启后不再有异常信息了。

但是查看HDFS日志时,发现如下警告(不合规范的URI格式):

WARN org.apache.hadoop.hdfs.server.common.Util:Path /mnt/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration

原来是配置错了,hdfs-site.xml中的

    <property><name>dfs.namenode.name.dir</name><value>/mnt/hadoop/dfs/name</value></property>

应该改为:

    <property><name>dfs.namenode.name.dir</name><value>file:/mnt/hadoop/dfs/name</value></property>

至此问题解决,根上的问题应该是hdfs-site.xml配置的不对导致的。

4.Unable to load native-hadoop library for your platform

Flink启动时,有时会有如下警告信息:

WARN  org.apache.hadoop.util.NativeCodeLoader    

- Unable to load native-hadoop library for your platform... 

using builtin-java classes where applicable

参考资料1:http://blog.csdn.net/jack85986370/article/details/51902871

解决方案:编辑/etc/profile文件,增加

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

未能解决该问题

5.hadoop checknative -a

WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop:  true /usr/hadoop-2.7.3/lib/native/libhadoop.so.1.0.0
zlib:    true /lib64/libz.so.1
snappy:  false
lz4:     true revision:99
bzip2:   false
openssl: false Cannot load libcrypto.so (libcrypto.so: cannot open shared object file: No such file or directory)!
INFO util.ExitUtil: Exiting with status 1

参考资料:http://blog.csdn.net/zhangzhaokun/article/details/50951238

解决方案

 cd /usr/lib64/ln -s libcrypto.so.1.0.1e libcrypto.so

6.TaskManager退出

Flink运行一段时间后,出现TaskManager退出情况,通过jvisualvm抓取TaskManager的Dump,使用MAT进行分析,结果如下:

One instance of "org.apache.flink.runtime.io.network.buffer.NetworkBufferPool"loaded by "sun.misc.Launcher$AppClassLoader @ 0x6c01de310" occupies 403,429,704 (76.24%) bytes. The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".Keywords
sun.misc.Launcher$AppClassLoader @ 0x6c01de310
java.lang.Object[]
org.apache.flink.runtime.io.network.buffer.NetworkBufferPool

发现是网络缓冲池不足,查到一篇文章:

https://issues.apache.org/jira/browse/FLINK-4536

和我遇到的情况差不多,也是使用了InfluxDB作为Sink,最后在Close里进行关闭,问题解决。

另外,在$FLINK_HOME/conf/flink-conf.yaml中,也有关于TaskManager网络栈的配置,暂时未调整

# The number of buffers for the network stack.
#
# taskmanager.network.numberOfBuffers: 2048

7.Kafka partition leader切换导致Flink重启

现象:

7.1 Flink重启,查看日志,显示:

java.lang.Exception: Failed to send data to Kafka: This server is not the leader for that topic-partition.at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.checkErroneous(FlinkKafkaProducerBase.java:373)at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.invoke(FlinkKafkaProducerBase.java:280)at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:41)at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:206)at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.

7.2 查看Kafka的Controller日志,显示:

 INFO [SessionExpirationListener on 10], ZK expired; shut down all controller components and 

try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)

7.3 设置retries参数

参考:http://colabug.com/122248.html 以及 Kafka官方文档(http://kafka.apache.org/082/documentation.html#producerconfigs),关于producer参数设置

设置了retries参数,可以在Kafka的Partition发生leader切换时,Flink不重启,而是做3次尝试:

        kafkaProducerConfig{"bootstrap.servers": "192.169.2.20:9093,192.169.2.21:9093,192.169.2.22:9093""retries":3}

转载于:https://www.cnblogs.com/liugh/p/7533194.html

使用Flink时遇到的问题(不断更新中)相关推荐

  1. Android 高仿App项目归纳整理,持续更新中…

    Android 高仿App项目归纳整理,持续更新中- Android高仿App项目整理,包含高仿了一些大公司的app,有基于Java,Kotlin,Flutter等语言的.对于开发我们自已的项目时可以 ...

  2. Spark/Flink广播实现作业配置动态更新

    点击上方"zhisheng",选择"设为星标" 后台回复"ffa"可以查看 Flink 资料 前言 在实时计算作业中,往往需要动态改变一些配 ...

  3. 使用Flink时从Kafka中读取Array[Byte]类型的Schema

    使用Flink时,如果从Kafka中读取输入流,默认提供的是String类型的Schema: val myConsumer = new FlinkKafkaConsumer08[String](&qu ...

  4. 谷歌浏览器检查更新时出错:无法启动更新检查(错误代码为 3: 0x80080005 -- system level)

    谷歌浏览器检查更新报错:检查更新时出错:无法启动更新检查(错误代码为 3: 0x80080005 – system level),如下图所示: 网上的解决方法都是因为谷歌被墙,所以要重新下载,事实上并 ...

  5. mysql插入实现存在更新_mysql 记录不存在时插入 记录存在则更新的实现方法

    mysql 记录不存在时插入在 MySQL 中,插入(insert)一条记录很简单,但是一些特殊应用,在插入记录前,需要检查这条记录是否已经存在,只有当记录不存在时才执行插入操作,本文介绍的就是这个问 ...

  6. [转帖]升级 Ubuntu,解决登录时提示有软件包可以更新的问题

    升级 Ubuntu,解决登录时提示有软件包可以更新的问题 2017年12月05日 11:58:17 阅读数:2953更多 个人分类: ubuntu Connecting to 10.24.88.188 ...

  7. 吃鸡无线重新链接服务器,吃鸡跟新时发生错误 无法连接更新服务器 | 手游网游页游攻略大全...

    发布时间:2017-01-21 更新后进不去怎么办?9月14日更新后,很多玩家在启动游戏时会出现错误提示的情况,今天小编带来"LFeng"分享的更新时发生错误解决方法,下面 ... ...

  8. NC65 对上年度反结账,调整数据后重新结账后,对本年度年初重算时系统报错:更新记数错误。

    1.对上年度反结账,调整数据后重新结账后,对本年度年初重算时系统报错:更新记数错误. 解决方案: 1.在期初余额节点,按Ctrl+ALT+A重建期初凭证: 2.到结账节点,重建余额表,选择有问题的财务 ...

  9. 谷歌浏览器:检查更新时出错:无法启动更新检查(错误代码为 4: 0x80070005 -- system level)

    问题描述: 谷歌浏览器更新出现 检查更新时出错:无法启动更新检查(错误代码为 4: 0x80070005 -- system level) 原因分析: 未打开谷歌更新服务 解决方案: 第一步:打开任务 ...

  10. 【第十三届蓝桥杯备战】C/C++解题时的一些个人小技巧和注意事项(持续更新中)

    [第十三届蓝桥杯备战]C/C++解题时的一些个人小技巧和注意事项(持续更新中) 输入输出 数组 数据结构 1. 线段树 调试 References 先说一些废话:我最近觉得写算法题就像是打格斗游戏一样 ...

最新文章

  1. [NC14301]K-th Number
  2. maven deploy plugin_Maven工程概念和关系
  3. 【翻译】Designing Websites for iPhone X
  4. js来读写cookie操作
  5. SAP License:ERP顾问们,为何你会面试失败?
  6. AtCoder Grand Contest 004 C - AND Grid(思路题)
  7. 【2019杭电多校第五场1005=HDU6628】permutation 1(全排列+预处理+思维)
  8. 时间序列分析与非参数统计
  9. js 的常用工具类库
  10. 利用栈来完成表达式求值
  11. 文本溢出处理,出现省略号,单行文本溢出处理,多行文本溢出处理
  12. Mybatis 报错Mapper method ‘xxx‘ has an unsupported return type
  13. 家庭财务管理系统的设计与实现(Java毕业设计-Springboot)
  14. “最强大脑”蒋昌建站台,原来是为这群白帽黑客和少年极客
  15. xp 无法关闭计算机,xp系统不能关机解决方法
  16. ●「.|貓」erPhotoshop滤镜巧制超级美女插画效果
  17. 【Android】自动瘦脸与眼睛放大美颜算法
  18. Labview图像视觉处理——VDM、VAS的下载安装
  19. 提到区块链,这一次微软没有再落后
  20. linux是优秀程序员吗,如何理解Linus Torvalds“什么才是优秀程序员”

热门文章

  1. 模拟信号可以传输声音和图像,那么文字呢--信息论系列
  2. 奇数码问题(逆序对)
  3. Dell Fluid FS 集群NAS系统在4K非编环境的卓越表现
  4. CART决策树(分类回归树)分析及应用建模
  5. 编译gtk+程序报错gcc: pkg-config --cflags --libs gtk+-2.0: 没有那个文件或目录
  6. hdu 1723 DP/递推
  7. 嵌入式小知识(累积更新)
  8. 【转】多线程Core Data
  9. 数据库中日期大小的判断
  10. android广告平台刷量,数据显示:Android平台广告营收首超iOS