以前也用过爬虫,比如使用nutch爬取指定种子,基于爬到的数据做搜索,还大致看过一些源码。当然,nutch对于爬虫考虑的是十分全面和细致的。每当看到屏幕上唰唰过去的爬取到的网页信息以及处理信息的时候,总感觉这很黑科技。正好这次借助梳理Spring MVC的机会,想自己弄个小爬虫,简单没关系,有些小bug也无所谓,我需要的只是一个能针对某个种子网站能爬取我想要的信息就可以了。有Exception就去解决,可能是一些API使用不当,也可能是遇到了http请求状态异常,又或是数据库读写有问题,就是在这个报exception和解决exception的过程中,JewelCrawler(儿子的小名)已经可以能够独立的爬取数据,并且还有一项基于Word2Vec算法做个情感分析的小技能。

  后面可能还会有未知的Exception等着解决,也有一些性能需要优化,比如和数据库的交互,数据的读写等等。但是目测年内没有太多精力放这上面了,所以今天做一个简单的总结,而且前两篇主要侧重的是功能和结果,这篇来说说JewelCrawler是如何诞生的,并将代码放到Github上(源码地址在文章最后),有兴趣的可以关注下(仅供交流学习,请勿他用,考虑下douban君。多一点真诚,少一点伤害)

环境介绍

  开发工具:Intellij idea 14

  数据库: Mysql 5.5 + 数据库管理工具Navicat(可用来连接查询数据库)

  语言:Java

  Jar包管理:Maven

  版本管理:Git

目录结构

  其中

  com.ansj.vec是Word2Vec算法的Java版本实现

  com.jackie.crawler.doubanmovie是爬虫实现模块,其中又包括

  有些包是空的,因为这些模块还没有用上,其中

    constants包是存放常量类

    crawl包存放爬虫入口程序

    entity包映射数据库表的实体类

    test包存放测试类

    utils包存放工具类

  resource模块存放的是配置文件和资源文件,比如

    beans.xml:Spring上下文的配置文件

    seed.properties:种子文件

    stopwords.dic:停用词库

    comment12031715.txt:爬取的短评数据

    tokenizerResult.txt:使用IKAnalyzer分词后的结果文件

    vector.mod:基于Word2Vec算法训练的模型数据

  test模块是测试模块,用于编写UT.

数据库配置

  1. 添加依赖的包

  JewelCrawler使用的maven管理,所以只需要在pom.xml中添加相应的依赖就可以了

<dependency><groupId>org.springframework</groupId><artifactId>spring-jdbc</artifactId><version>4.1.1.RELEASE</version>
</dependency>
<dependency><groupId>commons-pool</groupId><artifactId>commons-pool</artifactId><version>1.6</version>
</dependency>
<dependency><groupId>commons-dbcp</groupId><artifactId>commons-dbcp</artifactId><version>1.4</version>
</dependency>
<dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.38</version>
</dependency>
<dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.38</version>
</dependency>

  

  2. 声明数据源bean

  我们需要在beans.xml中声明数据源的bean

<context:property-placeholder location="classpath*:*.properties"/>
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"><property name="driverClassName" value="${jdbc.driver}"/><property name="url" value="${jdbc.url}"/><property name="username" value="${jdbc.username}"/><property name="password" value="${jdbc.password}"/>
</bean>

注意: 这里是绑定了外部配置文件jdbc.properties,具体数据源的参数从该文件读取。

  如果遇到问题“SQL [insert into user(id) values(?)]; Field 'name' doesn't  have a default value;”解决方法是设置表的相应字段为自增长字段。

解析页面遇到的问题

  对于爬到的网页数据需要解析dom结构,拿到自己想要的数据,期间遇到如下错误

  org.htmlparser.Node不识别

  解决方法:添加jar包依赖

<dependency><groupId>org.htmlparser</groupId><artifactId>htmlparser</artifactId><version>1.6</version>
</dependency>

  

  org.apache.http.HttpEntity不识别

  解决方法:添加jar包依赖

<dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.2</version>
</dependency> 

  当然这是期间遇到的问题,最后用的是Jsoup做的页面解析。

maven仓库下载速度慢

  之前使用的是默认的maven中央仓库,下载jar包的速度很慢,不知道是我的网络问题还是其他原因,后来在网上找到了阿里云的maven仓库,更新后,相比之前简直是秒下,吐血推荐。

<mirrors><mirror><id>alimaven</id><name>aliyun maven</name><url>http://maven.aliyun.com/nexus/content/groups/public/</url><mirrorOf>central</mirrorOf>        </mirror>
</mirrors>

  找到maven的settings.xml文件,添加这个镜像即可。

读取resource模块下文件的一种方法

  比如读取seed.properties文件

@Testpublic void testFile(){File seedFile = new File(this.getClass().getResource("/seed.properties").getPath());System.out.print("===========" + seedFile.length() + "===========" );}

  

有关正则表达式

  使用regrex正则表达式的时候,如果匹配上了定义的Pattern,则需要先调用matcher的find方法然后才能使用group方法找到子串。直接调用group方法是没有办法找到你想要的结果的。

  我看了下上面Matcher类的源码

package java.util.regex;import java.util.Objects;public final class Matcher implements MatchResult {/*** The Pattern object that created this Matcher.*/Pattern parentPattern;/*** The storage used by groups. They may contain invalid values if* a group was skipped during the matching.*/int[] groups;/*** The range within the sequence that is to be matched. Anchors* will match at these "hard" boundaries. Changing the region* changes these values.*/int from, to;/*** Lookbehind uses this value to ensure that the subexpression* match ends at the point where the lookbehind was encountered.*/int lookbehindTo;/*** The original string being matched.*/CharSequence text;/*** Matcher state used by the last node. NOANCHOR is used when a* match does not have to consume all of the input. ENDANCHOR is* the mode used for matching all the input.*/static final int ENDANCHOR = 1;static final int NOANCHOR = 0;int acceptMode = NOANCHOR;/*** The range of string that last matched the pattern. If the last* match failed then first is -1; last initially holds 0 then it* holds the index of the end of the last match (which is where the* next search starts).*/int first = -1, last = 0;/*** The end index of what matched in the last match operation.*/int oldLast = -1;/*** The index of the last position appended in a substitution.*/int lastAppendPosition = 0;/*** Storage used by nodes to tell what repetition they are on in* a pattern, and where groups begin. The nodes themselves are stateless,* so they rely on this field to hold state during a match.*/int[] locals;/*** Boolean indicating whether or not more input could change* the results of the last match.** If hitEnd is true, and a match was found, then more input* might cause a different match to be found.* If hitEnd is true and a match was not found, then more* input could cause a match to be found.* If hitEnd is false and a match was found, then more input* will not change the match.* If hitEnd is false and a match was not found, then more* input will not cause a match to be found.*/boolean hitEnd;/*** Boolean indicating whether or not more input could change* a positive match into a negative one.** If requireEnd is true, and a match was found, then more* input could cause the match to be lost.* If requireEnd is false and a match was found, then more* input might change the match but the match won't be lost.* If a match was not found, then requireEnd has no meaning.*/boolean requireEnd;/*** If transparentBounds is true then the boundaries of this* matcher's region are transparent to lookahead, lookbehind,* and boundary matching constructs that try to see beyond them.*/boolean transparentBounds = false;/*** If anchoringBounds is true then the boundaries of this* matcher's region match anchors such as ^ and $.*/boolean anchoringBounds = true;/*** No default constructor.*/Matcher() {}/*** All matchers have the state used by Pattern during a match.*/
Matcher(Pattern parent, CharSequence text) {this.parentPattern = parent;this.text = text;// Allocate state storageint parentGroupCount = Math.max(parent.capturingGroupCount, 10);groups = new int[parentGroupCount * 2];locals = new int[parent.localCount];// Put fields into initial statesreset();
}
....
/*** Returns the input subsequence matched by the previous match.** <p> For a matcher <i>m</i> with input sequence <i>s</i>,* the expressions <i>m.</i><tt>group()</tt> and* <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt>* are equivalent.  </p>** <p> Note that some patterns, for example <tt>a*</tt>, match the empty* string.  This method will return the empty string when the pattern* successfully matches the empty string in the input.  </p>** @return The (possibly empty) subsequence matched by the previous match,*         in string form** @throws  IllegalStateException*          If no match has yet been attempted,*          or if the previous match operation failed*/
public String group() {return group(0);
}
/*** Returns the input subsequence captured by the given group during the* previous match operation.** <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index* <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and* <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt>* are equivalent.  </p>** <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left* to right, starting at one.  Group zero denotes the entire pattern, so* the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>.* </p>** <p> If the match was successful but the group specified failed to match* any part of the input sequence, then <tt>null</tt> is returned. Note* that some groups, for example <tt>(a*)</tt>, match the empty string.* This method will return the empty string when such a group successfully* matches the empty string in the input.  </p>** @param  group*         The index of a capturing group in this matcher's pattern** @return  The (possibly empty) subsequence captured by the group*          during the previous match, or <tt>null</tt> if the group*          failed to match part of the input** @throws  IllegalStateException*          If no match has yet been attempted,*          or if the previous match operation failed** @throws  IndexOutOfBoundsException*          If there is no capturing group in the pattern*          with the given index*/
public String group(int group) {if (first < 0)throw new IllegalStateException("No match found");if (group < 0 || group > groupCount())throw new IndexOutOfBoundsException("No group " + group);if ((groups[group*2] == -1) || (groups[group*2+1] == -1))return null;return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
}
/*** Attempts to find the next subsequence of the input sequence that matches* the pattern.** <p> This method starts at the beginning of this matcher's region, or, if* a previous invocation of the method was successful and the matcher has* not since been reset, at the first character not matched by the previous* match.** <p> If the match succeeds then more information can be obtained via the* <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods.  </p>** @return  <tt>true</tt> if, and only if, a subsequence of the input*          sequence matches this matcher's pattern*/
public boolean find() {int nextSearchIndex = last;if (nextSearchIndex == first)nextSearchIndex++;// If next search starts before region, start it at regionif (nextSearchIndex < from)nextSearchIndex = from;// If next search starts beyond region then it failsif (nextSearchIndex > to) {for (int i = 0; i < groups.length; i++)groups[i] = -1;return false;}return search(nextSearchIndex);
}/*** Initiates a search to find a Pattern within the given bounds.* The groups are filled with default values and the match of the root* of the state machine is called. The state machine will hold the state* of the match as it proceeds in this matcher.** Matcher.from is not set here, because it is the "hard" boundary* of the start of the search which anchors will set to. The from param* is the "soft" boundary of the start of the search, meaning that the* regex tries to match at that index but ^ won't match there. Subsequent* calls to the search methods start at a new "soft" boundary which is* the end of the previous match.*/
boolean search(int from) {this.hitEnd = false;this.requireEnd = false;from        = from < 0 ? 0 : from;this.first  = from;this.oldLast = oldLast < 0 ? from : oldLast;for (int i = 0; i < groups.length; i++)groups[i] = -1;acceptMode = NOANCHOR;boolean result = parentPattern.root.match(this, from, text);if (!result)this.first = -1;this.oldLast = this.last;return result;
}
...
}

  原因是这样的:这里如果不先调用find方法,直接调用group,可以发现group方法调用group(int group),该方法的方法体中有if first<0,显然这里这个条件是成立的,因为first的初始值就是-1,所以这里会抛异常。但是如果调用find方法,可以发现,最终会调用search(nextSearchIndex),注意这里的nextSearchIndex已被last赋值,而last的值为0,再跳转到search方法中

boolean search(int from) {this.hitEnd = false;this.requireEnd = false;from        = from < 0 ? 0 : from;this.first  = from;this.oldLast = oldLast < 0 ? from : oldLast;for (int i = 0; i < groups.length; i++)groups[i] = -1;acceptMode = NOANCHOR;boolean result = parentPattern.root.match(this, from, text);if (!result)this.first = -1;this.oldLast = this.last;return result;
}

  这个nextSearchIndex传给了from,而from在方法体中被赋值给了first,所以,调用了find方法之后,这个的first就不在是-1,也就不是抛异常了。

  源码已经上传至Github:https://github.com/DMinerJackie/JewelCrawler

  以上说的问题比较碎,都是在遇到问题和解决问题的时候的一些总结。在具体操作的时候还会遇到其他问题,有问题或者建议的话欢迎提出来^^。

  最后放几张截止目前爬取的数据

  Record表

  其中存储的是79032条,爬取过的网页有48471条

  movie表

  目前爬取了2964部影视作品

  comments表

  爬取了29711条记录

  如果您觉得阅读本文对您有帮助,请点一下“推荐”按钮,您的“推荐”将是我最大的写作动力!如果您想持续关注我的文章,请扫描二维码,关注JackieZheng的微信公众号,我会将我的文章推送给您,并和您一起分享我日常阅读过的优质文章。

 

转载于:https://www.cnblogs.com/bigdataZJ/p/doubanmovie3.html

Java豆瓣电影爬虫——小爬虫成长记(附源码)相关推荐

  1. springboot基于JAVA的电影推荐系统的开发与实现 附源码-毕业设计112306

    目    录 摘要 4 Abstract 5 第1章前言 6 1.1研究背景 6 1.2研究现状 6 1.3系统开发目标 6 第2章技术与原理 8 2.1  JSP介绍 8 2.2  JAVA技术 8 ...

  2. java计算机毕业设计ssm学生课堂考勤小程序947n4(附源码、数据库)

    java计算机毕业设计ssm学生课堂考勤小程序947n4(附源码.数据库) 项目运行 环境配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ ...

  3. JAVA计算机毕业设计融呗智慧金融微资讯移动平台小程序端(附源码、数据库)

    JAVA计算机毕业设计融呗智慧金融微资讯移动平台小程序端(附源码.数据库) 目运行 环境项配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ ...

  4. 【Java游戏开发合集】毕业设计(附源码+资料+课件)

    为正在准备毕业设计的小伙伴们以及想自学一些Java练手项目,小编终于整理出了本套视频课程(附源码+资料+课件),快来白嫖吧!!! 视频教程链接: 全站最全Java游戏合集!毕业设计!(附源码课件)8款 ...

  5. java计算机毕业设计钢材出入库管理系统(附源码、数据库)

    java计算机毕业设计钢材出入库管理系统(附源码.数据库) 项目运行 环境配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ Eclispe ...

  6. Java毕设项目城市公交系统计算机(附源码+系统+数据库+LW)

    Java毕设项目城市公交系统计算机(附源码+系统+数据库+LW) 项目运行 环境配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ Ecli ...

  7. java计算机毕业设计BS用户小票系统(附源码、数据库)

    java计算机毕业设计BS用户小票系统(附源码.数据库) 项目运行 环境配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ Eclispe( ...

  8. JAVA计算机毕业设计演唱会购票系统计算机(附源码、数据库)

    JAVA计算机毕业设计演唱会购票系统计算机(附源码.数据库) 项目运行 环境配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ Eclisp ...

  9. java计算机毕业设计Internet快递柜管理系统(附源码、数据库)

    java计算机毕业设计Internet快递柜管理系统(附源码.数据库) 项目运行 环境配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ E ...

  10. JAVA计算机毕业设计弹幕视频网站计算机(附源码、数据库)

    JAVA计算机毕业设计弹幕视频网站计算机(附源码.数据库) 项目运行 环境配置: Jdk1.8 + Tomcat8.5 + Mysql + HBuilderX(Webstorm也行)+ Eclispe ...

最新文章

  1. CrazyWing:Python自动化运维开发实战 八、Python数据类型之字符串
  2. halcon的算子清点: Chapter 2-3-4 控制、开发、文件操作
  3. nginx 负载均衡 404_nginx配置负载均衡
  4. linux ns级定时器_linux用户空间获得ns纳秒级时间示例
  5. java jar包示例_Java包getImplementationVersion()方法和示例
  6. C语言需要什么程序翻译,c语言怎么翻译? 程序怎么运行?
  7. Exchange2003不能自动删除日志
  8. Android DeepLink使用
  9. 云计算发展趋势(二)实现云计算的技术以及其他新兴技术介绍
  10. bcftools操作vcf文件
  11. 最新Handsome主题V6.0免授权版+Typecho内核
  12. 深入浅出计算机原理组成--->指令与运算——指令跳转(2)
  13. java计算机毕业设计网上拍卖系统源码+数据库+系统+lw文档+mybatis+运行部署
  14. 我的阿里秋招之路——阿里实习offer+校招offer
  15. iOS开发 关于tableView加载图片时出现卡顿时的解决办法
  16. Java使用Spire.Pdf或Aspose-Words实现Word转换Pdf在Linux服务器上的中文乱码问题
  17. php 判断是否是机器人,PHP_php实现判断访问来路是否为搜索引擎机器人的方法,本文实例讲述了php实现判断访 - phpStudy...
  18. Keepalive User Guide for gRpc Core(and dependents)
  19. 小白都能懂的 玩转docker系列之 部署tomcat练习
  20. 《马云给年轻人的77条忠告》读书笔记2

热门文章

  1. 我的理想计算机作文300字,我的理想作文300字(通用5篇)
  2. 计算机网络之父是谁,因特网_被成为网络之父的是_计算机网络之父
  3. 关于使用video标签插入视频时,视频无法播放的问题
  4. 【CSDN竞赛第5期】编程竞赛总结
  5. js 拖曳功能--代码解析
  6. 规避三方工具带给产品的潜在风险
  7. 图片轮播的实现(详解两种方法)
  8. 诚之和:波司登羽绒服都上万了 “土味羽绒服”高溢价引争议
  9. 在firefox下载不收费的HackBar
  10. PYTHON学习之旅1:linux操作系统学习