java古诗_java抓取古诗文的单线程爬虫

准备知识

1.HTML, CSS, HTML DOM树

参考http://www.w3school.com.cn/htmldom/

2.Jsoup的使用，使用DOM方法遍历一个document对象，使用选择器语法来选择一个元素，从元素中抽取数据。

参考www.open-open.com/jsoup/example-list-links.htm

3.java正则表达式及其语法

参考http://www.cnblogs.com/chuiyuan/p/5187359.html

我们先来做一个单线程的爬虫。

整体步骤：

1.定义爬取内容的对象Poem结构。

2.完成从网上爬取Document对象的模块HttpService功能。

3.从Document对象中解析出所有唐诗的href，并保存到List中。

4.从3中得到的href再爬取出每首古诗的内容。

下面帖一下代码实现。

1.Poem对象只列出其属性

2.抓取Document对象模块HttpService

首先定义一个Rule，封装所有的请求，不管是get还是post。

在HttpService中，使用Rule来抓取document对象。

/**

* Created by chuiyuan on 2/11/16.

public class HttpService {

/**

* @param rule

* @return doc

public Document extrace(Rule rule){

validateRule(rule);

String url = rule.getUrl();

String [] params = rule.getParams() ;

String [] values = rule.getValues() ;

String resultTagName = rule.getResultTagName() ;

int type = rule.getType();

int requestMethod = rule.getRequestMethod() ;

Document doc =null ;

try {

Connection conn = Jsoup.connect(url);

conn.userAgent("Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.5.0");

if (params!= null){

for (int i =0; i

conn.data(params[i], values[i]);

}

//Document doc = null ;

switch (requestMethod){

case Rule.GET:

doc =conn.timeout(10000).get();

break;

case Rule.POST:

doc =conn.timeout(10000).post();

break;

}

}catch (IOException e ){

System.out.println("No network");

//e.printStackTrace();

}

return doc ;

}

/**

* validate input params

private static void validateRule(Rule rule){

String url = rule.getUrl() ;

if (url == null || url.length()==0){

throw new RuleException("url can't be null");

}

if (!url.startsWith("http://")){

throw new RuleException("url not in correct format");

}

/**

* not consider total right

if (rule.getParams()!= null && rule.getValues()!= null){

if (rule.getParams().length!= rule.getValues().length){

throw new RuleException("params length!= values length");

}

可以看到，我们先对rule进行了验证，rule中的resultTagName并没有用到，可以去掉。

3.从Document中解析出所有唐诗的href

4.从每首诗文的href中提到详细的内容

public class ProcessDoc {

/**

* @param url

* @param httpService

* @return List

public List processGuShiWen(String url, HttpService httpService ){

Rule rule = new Rule(url,

null,null,

null,

-1,

Rule.GET);

Document doc = httpService.extrace(rule);

if (doc == null){

System.out.println("doc null");

return null;

}

//left panel

Elements mainEles = doc.select("div.leftlei");

//String dynasty = mainEles.select("div.son1").first().text();

Elements poemsEles = mainEles.select("div.son2");//size=7 category

//System.out.println(poemsEles.size());

List poemList = new ArrayList();

//Pattern p =Pattern.compile("\\(|\\)");

//some times there is no author

for (Element poemsEle : poemsEles){

Elements poemEles = poemsEle.select("span");

String category1 = poemEles.get(0).text(); //category with ":"

String category = category1.substring(0,category1.length()-1);

for (int i =1;i

Poem poem = new Poem() ;

//poem.setDynasty(dynasty);

poem.setCategory(category);

poem.setHref(poemEles.get(i).select("a").attr("abs:href")+"/");//ref

poem.setTitle(poemEles.get(i).select("a").text());

poemList.add(poem);

}

/*for (Poem poem : poemList){

System.out.println(poem.getCategory()+

" "+poem.getTitle()+

" "+poem.getHref());

}*/

return poemList ;

}

/**

* get details of poem

* @param poem

* @param httpService

public void processDetails(Poem poem, HttpService httpService ){

String url = poem.getHref() ;

//System.out.println(url);

Rule rule = new Rule(url,

null,null,

null,

-1,

Rule.GET);

Document doc = httpService.extrace(rule);

if (doc == null) {

System.out.println("doc=null");

return;

}

Elements mainEles = doc.select("div.son2");

//title already ok

Elements poemDetailEles = mainEles.select("p");

String dynasty = poemDetailEles.get(0).text().split("：")[1];//note，chinese :

//Stng dynasty = poemDetailEles.get(0).getElementsByTag("span").text();

//System.out.println(dynasty);

String author = poemDetailEles.get(1).getElementsByTag("a").text();

//System.out.println(author);

String content = mainEles.text().split("原文：")[1];//note

//System.out.println(content);

//do not consider translation

//Element translationEle = doc.select("#")

poem.setDynasty(dynasty);

poem.setAuthor(author);

poem.setContent(content);

//System.out.println(poem.toString());

}

整体调用如下

/**

* single thread model,

* save to mysql

public void getGuShiWenSingleThread(){

String url ="http://so.gushiwen.org/gushi/tangshi.aspx/";

HttpService httpService = new HttpService();

ProcessDoc processDoc= new ProcessDoc();

List poemList = processDoc.processGuShiWen(url, httpService);

//get poem content details

for (Poem poem : poemList){

processDoc.processDetails(poem,httpService);

System.out.println(poem.toString());

}

//store to mysql

PoemDao poemDao = new PoemDaoImpl() ;//not PoemDaoImpl

for (Poem poem: poemList){

try {

poemDao.add(poem);

}catch (SQLException e){

e.printStackTrace();

}

最后结果保存到了MySQL中，数据库部分将在下一篇文章中讲解。

java古诗_java抓取古诗文的单线程爬虫相关推荐

java 股票_java抓取东方财富股票数据
背景前段时间给朋友写了一个自动抓取同花顺股票数据的程序,不少人觉得不错. 这几天后台有粉丝给我留言让我也抓一下东方财富的数据,说东方财富的数据特别难抓,我还真不一定能搞得定. 本来我是一个德艺双磬且 ...
网页java代码_java抓取网页代码
导读热词代码以下 import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.HttpURLCo ...
第一个简单Python爬虫：抓取古诗文网中李白的诗歌
2018年10月11日这是第一个博客,嘻嘻~~~~ 最近老师给了个任务:爬取诗歌.于是乎,走上了爬虫道路,爬取了李白的诗歌. 感谢代码的原作者(唐诗三百首,源代码). 遇到的问题与收获: 1.熟悉 ...
爬取古诗文网的推荐古诗
爬取古诗文网的推荐古诗思路分析完整代码结果展示思路分析本次的主要目的是练习使用正则表达式提取网页中的数据. 该网站的推荐古诗文一共有10页,页码可以在URL中进行控制,比如说,下面的URL指 ...
Python使用网络抓包的方式，利用超级鹰平台识别验证码登录爬取古诗文网、上篇--识别验证码
Python使用网络抓包的方式,利用超级鹰平台识别验证码登录,<爬取古诗文网>. 上篇–识别验证码序言: 哈喽,各位小可爱们,我又来了,这次我新学习到的内容是python爬虫识别验证码. ...
python-爬取古诗文网古诗
标题:爬取古诗文网古诗 # encoding=utf-8 import requests import re# 请求数据 def parse_page(url):headers = {'User-Ag ...
用正则表达式爬取古诗文网站，边玩边学
用正则表达式爬取古诗文网站,边玩边学古诗文网站是一个充满了文化气息的网站,里面收录了大量的古代诗词和文章,对于喜欢文化和历史的人来说是一个非常不错的学习资源.但是如果需要大量下载或者获取古诗文网站上 ...
爬虫实战之爬取古诗文网站（详细）
爬取古诗文网站重点是练习正则表达式的使用链接变化 url_base = 'https://www.gushiwen.cn/default_{}.aspx' for i in range(1, 2) ...
java抓取网页标题内容_[Java教程]java 网页页面抓取标题和正文
[Java教程]java 网页页面抓取标题和正文 0 2014-07-10 09:01:30 import java.io.BufferedReader;import java.io.IOExcept ...

java古诗_java抓取古诗文的单线程爬虫

java古诗_java抓取古诗文的单线程爬虫相关推荐

最新文章

热门文章