java爬取捧腹网段子

先上效果图：

准备工作：

/*** 建立http连接*/
public static String Connect(String address) {HttpURLConnection conn = null;URL url = null;InputStream in = null;BufferedReader reader = null;StringBuffer stringBuffer = null;try {url = new URL(address);conn = (HttpURLConnection) url.openConnection();conn.setConnectTimeout(5000);conn.setReadTimeout(5000);conn.setDoInput(true);conn.connect();in = conn.getInputStream();reader = new BufferedReader(new InputStreamReader(in));stringBuffer = new StringBuffer();String line = null;while ((line = reader.readLine()) != null) {stringBuffer.append(line);}} catch (Exception e) {e.printStackTrace();} finally {conn.disconnect();try {in.close();reader.close();} catch (Exception e) {e.printStackTrace();}}return stringBuffer.toString();
}

/*** 用于将内容写入到磁盘文件* @param allText*/
private static void writeToFile(String allText) {System.out.println("正在写入。。。");BufferedOutputStream bos = null;try {File targetFile = new File("/Users/shibo/tmp/pengfu.txt");File fileDir = targetFile.getParentFile();if (!fileDir.exists()) {fileDir.mkdirs();}if (!targetFile.exists()) {targetFile.createNewFile();}bos = new BufferedOutputStream(new FileOutputStream(targetFile, true));bos.write(allText.getBytes());} catch (IOException e) {e.printStackTrace();} finally {if (null != bos) {try {bos.close();} catch (IOException e) {e.printStackTrace();}}}System.out.println("写入完毕。。。");
}

引入jsoup的jar包（用于解析dom）：

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.11.2</version>
</dependency>

开始分析网站:

捧腹网段子
首先找到我们需要的内容（作者、标题和正文）

查看其元素，我这里查看的是标题标签：

知道其结构之后，就可以获取我们想要的内容了：

    public static void main(String[] args) {StringBuilder allText = new StringBuilder();for (int i = 1; i <= 50; i++) {System.out.println("正在爬取第" + i + "页内容。。。");// 建立连接，获取网页内容String html = ConnectionUtil.Connect("https://www.pengfu.com/xiaohua_" + i + ".html");// 将内容转换成dom格式，方便操作Document doc = Jsoup.parse(html);// 获取网页内所有标题节点Elements titles = doc.select("h1.dp-b");for (Element titleEle : titles) {Element parent = titleEle.parent();// 标题内容String title = titleEle.getElementsByTag("a").text();// 标题对应的作者String author = parent.select("p.user_name_list > a").text();// 标题对应的正文String content = parent.select("div.content-img").text();// 将内容格式化allText.append(title).append("\r\n作者：").append(author).append("\r\n").append(content).append("\r\n").append("\r\n");}allText.append("-------------第").append(i).append("页-------------").append("\r\n");System.out.println("第" + i + "页内容爬取完毕。。。");}//将内容写入磁盘Test.writeToFile(allText.toString());}

参考文章：Python 爬虫入门(一)——爬取糗百

java爬取捧腹网段子相关推荐

java爬取捧腹网段子（多线程版）
前文链接 : java爬取捧腹网段子上一篇文章讲述了如何使用Java爬取内容并写入文件,但是速度堪忧,今天将代码搞成了多线程版本, 具体方式如下: 新建一个splider类,继承callable接口 ...
java 段子_java爬取捧腹网段子
先上效果图: 准备工作: /** * 建立http连接 */ public static String Connect(String address) { HttpURLConnection conn ...
python爬虫之爬取捧腹网段子
原文链接:http://www.nicemxp.com/articles/12 背景:抓取捧腹网首页的段子和搞笑图片链接如图: 地址:https://www.pengfu.com/ 首页中有很多子页 ...
java的段子_java爬取捧腹网段子（多线程版）
上一篇文章讲述了如何使用Java爬取内容并写入文件,但是速度堪忧,今天将代码搞成了多线程版本, 具体方式如下: 新建一个splider类,继承callable接口,用于存放我们需要多线程执行的逻辑: ...
python爬取捧腹网gif图片
#_*_coding:utf-8_*_ #爬取捧腹网GIF图片 import urllib,re import urllib.request import chardet #需要导入这个模块,检测编码 ...
[GO]并的爬取捧腹的段子
package mainimport ("fmt""strconv""net/http""regexp""st ...
Golang实现并发版网络爬虫：捧腹网段子爬取并保存文件
爬取捧腹网段子 url分页分析 https://www.pengfu.com/xiaohua_1.html 1 下一页+1 https://www.pengfu.com/xiaohua_2.html ...
python3制作捧腹网段子页爬虫
0x01 春节闲着没事(是有多闲),就写了个简单的程序,来爬点笑话看,顺带记录下写程序的过程.第一次接触爬虫是看了这么一个帖子,一个逗逼,爬取煎蛋网上妹子的照片,简直不要太方便.于是乎就自己照猫画虎, ...
java小说目录提取_完整Java爬取起点小说网小说目录以及对应链接
完整Java爬取起点小说网小说目录以及对应链接完整Java爬取起点小说网小说目录以及对应链接 (第一次使用markdown写,其中的排版很不好,望大家理解) ?? 因为最近有一个比赛的事情,故前期看 ...

java爬取捧腹网段子

java爬取捧腹网段子相关推荐

最新文章

热门文章