spark出现bytes consumed error的问题
完整报错如下:
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
edu.stanford.nlp.pipeline.AnnotatorImplementations:
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.9 sec].
Adding annotator lemma
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.RuntimeException: bytes consumed error!at edu.umd.cloud9.collection.XMLInputFormat$XMLRecordReader.nextKeyValue(XMLInputFormat.java:170)at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)at org.apache.spark.scheduler.Task.run(Task.scala:109)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)Driver stacktrace:at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)at scala.Option.foreach(Option.scala:257)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)at org.apache.spark.rdd.RDD.count(RDD.scala:1162)at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:176)at AssembleDocumentTermMatrix.documentTermMatrix(AssembleDocumentTermMatrix.scala:133)at RunLSA$.main(RunLSA.scala:48)at RunLSA.main(RunLSA.scala)
Caused by: java.lang.RuntimeException: bytes consumed error!at edu.umd.cloud9.collection.XMLInputFormat$XMLRecordReader.nextKeyValue(XMLInputFormat.java:170)at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)at org.apache.spark.scheduler.Task.run(Task.scala:109)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator lemmaProcess finished with exit code 1
以上问题是出现在《Spark高级数据分析》第六章,Spark进行LDA的时候出现的问题。
解决方案:
其实就是我们使用的xml文件的标签数量不一致,
例如<page>是6个,</page>是7个,就会导致上面报错,改过来就好了。
之所以会出现这个问题,是因为代码调试需要,所以我们截取了原先大数据集中的一小部分,但是忘记截取原先的xml中的尾巴部分了,所以导致新的xml文件在格式上是不完整(也就是所有标签没有刚好成对)
spark出现bytes consumed error的问题相关推荐
- spark报Got an error when resolving hostNames. Falling back to /default-rack for all
一.报错代码如下: 21/06/01 20:13:36 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Fall ...
- Spark报错:ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
错误 今天运行 Spark 任务时报了一个错误,如下所示: ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of ...
- 《Spark系列-SparkCore》IDEA运行Spark代码异常 -> Error:scalac: IO error while decoding \Demo2.scala with UTF-8
IDEA运行Spark代码异常 -> Error:scalac: IO error while decoding \Demo2.scala with UTF-8 IDEA异常 Error:sca ...
- spark编译错误解决 Error:(52, 75) not found: value TCLIService
对于2.20版本可能会出现以下问题: spark\sql\hive-thriftserver\src\main\java\org\apache\hive\service\cli\thrift\Thri ...
- Spark 报错 : Error: bad symbolic reference. A signature in SparkContext.class refers to term conf
报错如下: Error:scalac: Error: bad symbolic reference. A signature in SparkContext.class refers to term ...
- 【Kafka】ERROR Error when sending message to topic topic_lcc with key: null, value: 4 bytes with error
文章目录 1.场景1 1.1.背景 1.2. 远程连接 2.场景再现2 2.1 背景 3. 场景再现3 N.扩展阅读 1.场景1 1.1.背景 [lcc@lcc ~/soft/kafka/kafka_ ...
- ST-LINK/V2 + STM8 + STVP 下载程序+Error on Option Bytes
gdi-error [40701]: Option bytes read error: not complemented; please use a programmer 个人感觉,ST-LINk/V ...
- Spark读取Hdfs上的数据存储在Hbase的ETL过程
开发:我们再向Hbase中写入数据的时候,尽量批量数据导入,如果一条一条的导入数据,会使得hbase的客户端,频繁的访问habse server,批量导入hbase可以自行参考网上资料规范. pack ...
- Win10下安装Spark的尝试总结(尚未出坑)
今天尝试部署Spark(Win10环境),学习了很多很有帮助的文档.基本的次序也算了解了一些.但最后还是没有成功.后面继续努力吧.但学习了一些点还是值得记录下: 1-Spark类似Mysql Work ...
最新文章
- 单一docker主机网络
- 吴恩达机器学习笔记 —— 9 神经网络学习
- python 跳出多重循环
- Python基础类型之元组
- C++虚析构和纯虚析构
- 超线程_超线程加核显 i310100+梅捷H410超值爆款组合
- android java11,Android RxJava1 入门教程
- 【AI】(收藏)从 A-Z 全面盘点人工智能专业术语梳理!
- 【报告分享】2020-2021年中国职业教育投融资发展报告.pdf(附下载链接)
- (Node*)malloc(sizeof(Node))的理解
- mysql 设计超市订单图,JSP+MySQL校园网络超市系统的设计与实现
- 关于static的使用
- KETTLE4个工作中有用的复杂实例--1、数据定时自动(自动抽取)同步作业
- Spring boot整合Drools、flowable决策引擎解决方案
- python绘制五角星
- 小程序成四大行业商家标配,小程序代理市场如何
- Q3亏损收窄预计Q4季度实现盈利,趣头条走上盈利分水岭靠什么?
- 华为IT总监离职时给大家写了一封告别信
- SQL Server 查询案例
- 【3DSmax】3DSmax9基础建模教程—读书笔记2(第二课)
热门文章
- 如何判断derived-to-base conversion是否legal
- EPUB.js 解决图片裁剪问题(缩放问题)
- [75] Making arrangements
- DateTimePicker 日期时间选择器报错 Cannot read property ‘getHours‘ of undefined, 无法选中`[__ob_: observer__]`时做判断
- 运算放大器基本公式_运算放大器积分器的些微差异
- 使用vant 制作导航栏
- springboot上传文件及文件上传限制大小异常捕获
- hibernate中的一级缓存
- QtUI设计:设置控件透明
- 图像基本群运算--滤波