完整报错如下:
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
edu.stanford.nlp.pipeline.AnnotatorImplementations:
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.9 sec].
Adding annotator lemma
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.RuntimeException: bytes consumed error!at edu.umd.cloud9.collection.XMLInputFormat$XMLRecordReader.nextKeyValue(XMLInputFormat.java:170)at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)at org.apache.spark.scheduler.Task.run(Task.scala:109)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)Driver stacktrace:at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)at scala.Option.foreach(Option.scala:257)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)at org.apache.spark.rdd.RDD.count(RDD.scala:1162)at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:176)at AssembleDocumentTermMatrix.documentTermMatrix(AssembleDocumentTermMatrix.scala:133)at RunLSA$.main(RunLSA.scala:48)at RunLSA.main(RunLSA.scala)
Caused by: java.lang.RuntimeException: bytes consumed error!at edu.umd.cloud9.collection.XMLInputFormat$XMLRecordReader.nextKeyValue(XMLInputFormat.java:170)at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)at org.apache.spark.scheduler.Task.run(Task.scala:109)at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at java.lang.Thread.run(Thread.java:748)
Adding annotator tokenize
Adding annotator ssplit
Adding annotator pos
Adding annotator lemmaProcess finished with exit code 1

以上问题是出现在《Spark高级数据分析》第六章,Spark进行LDA的时候出现的问题。

解决方案:

其实就是我们使用的xml文件的标签数量不一致,

例如<page>是6个,</page>是7个,就会导致上面报错,改过来就好了。

之所以会出现这个问题,是因为代码调试需要,所以我们截取了原先大数据集中的一小部分,但是忘记截取原先的xml中的尾巴部分了,所以导致新的xml文件在格式上是不完整(也就是所有标签没有刚好成对)

spark出现bytes consumed error的问题相关推荐

  1. spark报Got an error when resolving hostNames. Falling back to /default-rack for all

    一.报错代码如下: 21/06/01 20:13:36 INFO yarn.SparkRackResolver: Got an error when resolving hostNames. Fall ...

  2. Spark报错:ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

    错误 今天运行 Spark 任务时报了一个错误,如下所示: ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of ...

  3. 《Spark系列-SparkCore》IDEA运行Spark代码异常 -> Error:scalac: IO error while decoding \Demo2.scala with UTF-8

    IDEA运行Spark代码异常 -> Error:scalac: IO error while decoding \Demo2.scala with UTF-8 IDEA异常 Error:sca ...

  4. spark编译错误解决 Error:(52, 75) not found: value TCLIService

    对于2.20版本可能会出现以下问题: spark\sql\hive-thriftserver\src\main\java\org\apache\hive\service\cli\thrift\Thri ...

  5. Spark 报错 : Error: bad symbolic reference. A signature in SparkContext.class refers to term conf

    报错如下: Error:scalac: Error: bad symbolic reference. A signature in SparkContext.class refers to term ...

  6. 【Kafka】ERROR Error when sending message to topic topic_lcc with key: null, value: 4 bytes with error

    文章目录 1.场景1 1.1.背景 1.2. 远程连接 2.场景再现2 2.1 背景 3. 场景再现3 N.扩展阅读 1.场景1 1.1.背景 [lcc@lcc ~/soft/kafka/kafka_ ...

  7. ST-LINK/V2 + STM8 + STVP 下载程序+Error on Option Bytes

    gdi-error [40701]: Option bytes read error: not complemented; please use a programmer 个人感觉,ST-LINk/V ...

  8. Spark读取Hdfs上的数据存储在Hbase的ETL过程

    开发:我们再向Hbase中写入数据的时候,尽量批量数据导入,如果一条一条的导入数据,会使得hbase的客户端,频繁的访问habse server,批量导入hbase可以自行参考网上资料规范. pack ...

  9. Win10下安装Spark的尝试总结(尚未出坑)

    今天尝试部署Spark(Win10环境),学习了很多很有帮助的文档.基本的次序也算了解了一些.但最后还是没有成功.后面继续努力吧.但学习了一些点还是值得记录下: 1-Spark类似Mysql Work ...

最新文章

  1. 单一docker主机网络
  2. 吴恩达机器学习笔记 —— 9 神经网络学习
  3. python 跳出多重循环
  4. Python基础类型之元组
  5. C++虚析构和纯虚析构
  6. 超线程_超线程加核显 i310100+梅捷H410超值爆款组合
  7. android java11,Android RxJava1 入门教程
  8. 【AI】(收藏)从 A-Z 全面盘点人工智能专业术语梳理!
  9. 【报告分享】2020-2021年中国职业教育投融资发展报告.pdf(附下载链接)
  10. (Node*)malloc(sizeof(Node))的理解
  11. mysql 设计超市订单图,JSP+MySQL校园网络超市系统的设计与实现
  12. 关于static的使用
  13. KETTLE4个工作中有用的复杂实例--1、数据定时自动(自动抽取)同步作业
  14. Spring boot整合Drools、flowable决策引擎解决方案
  15. python绘制五角星
  16. 小程序成四大行业商家标配,小程序代理市场如何
  17. Q3亏损收窄预计Q4季度实现盈利,趣头条走上盈利分水岭靠什么?
  18. 华为IT总监离职时给大家写了一封告别信
  19. SQL Server 查询案例
  20. 【3DSmax】3DSmax9基础建模教程—读书笔记2(第二课)

热门文章

  1. 如何判断derived-to-base conversion是否legal
  2. EPUB.js 解决图片裁剪问题(缩放问题)
  3. [75] Making arrangements
  4. DateTimePicker 日期时间选择器报错 Cannot read property ‘getHours‘ of undefined, 无法选中`[__ob_: observer__]`时做判断
  5. 运算放大器基本公式_运算放大器积分器的些微差异
  6. 使用vant 制作导航栏
  7. springboot上传文件及文件上传限制大小异常捕获
  8. hibernate中的一级缓存
  9. QtUI设计:设置控件透明
  10. 图像基本群运算--滤波