day2&day3-根据网址读取内容

1、爬虫

根据url地址，发送请求获取响应
解析响应内容，找到想要的数据并解析到新的url路径
参考：https://www.cnblogs.com/sanmubird/p/7857474.html

2、ClientComponent

1）导入

maven

           <dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.2</version></dependency>

参考：https://blog.csdn.net/mao_xiaoxi/article/details/89206313

jar包导入
- 下载
  - 进入网址 http://hc.apache.org/downloads.cgi
  - 下载资源包 https://mirror-hk.koddos.net/apache//httpcomponents/httpclient/binary/httpcomponents-client-5.1-bin.zip
- 导入
  - 在项目中添加lib文件夹，放入jar
  - 引入工程
    - 选择jar包 - 右键 - build path - Add to build path
    - 项目中产生了referenced libraries
- 检查
  - 右击项目名-build path - config build path ，选择libraries，看到添加的jar包
  - 参考：https://www.cnblogs.com/zhxdxf/p/7598371.html

2）基本步骤

创建HttpClient对象
创建请求方法的实力，指定请求的url。针对对应的请求，创建HttpGet/HTTPPost对象
设置请求参数。调用HttpGet.setParams()/HttpPost.setEntity()
发送请求 HttpClient对象的execute(request)发送请求，并返回HttpResponse
获取服务器的响应头：HttpResponse.getAllHeaders()；获取HttpEntity对象：HttpResponse.getEntity()，包装了服务器的响应内容
释放连接

3）HttpClient和CloseableHttpClient的区别

CloseableHttpClient实现了HttpClient接口
HttpClient不主动发起close，链接会维持一段时间。维持的时间内，其他并发进入以后，会抛出句柄不够的异常。TCP链接也可能会禁图CLOSE_WAIT状态，但没有接收到最后一侧握手信息，SOCKET会一直处于这个状态。
- 解决方法：HttpClient client = new HttpClient(new HttpClientParams(),new SimpleHttpConnectionManager(true));
- 缺点：相当于每次用完就关闭，但是会有多次new/close流程，对JVM内存消耗很大，会影响性能
CloseableHttpClient，连接池
参考：https://blog.csdn.net/qq_31868149/article/details/103402184

4）实验代码

读取网页步骤

1、打开浏览器——生成HttpClient创建HttpClient对象CloseableHttpClient client = HttpClientBuilder.create().build()
2、输入网址——创建get请求创建HttpGet对象，并键入网址HttpGet get = new HttpGet("");
3、按下确定键——执行get请求使用HttpClient执行请求，将结果作为响应输出CloseableHttpResponse res = client.execute(httpget);
4、显示结果——获取响应内容HttpEntity entity = res.getEntity();result = EntityUtils.toString(entity);根据响应的编码进行操作100：5200：成功响应300：404：页面不存在、资源未找到成功响应（响应编码为200），显示内容创建HTTPEntity对象，获取响应内容
5、关闭浏览器res.close();
参考：https://blog.csdn.net/u014429653/article/details/106985970

写入文件步骤

1、打开文件夹：File file = new File(path);
2、不存在文件夹则新建：file.createNewFile();
3、读写文件（文件写入、字节读写）FileWriter fw = new FileWriter(path);BufferedWriter bw = new BufferedWriter(fw);4、逐字节写入字符串：bw.write(str);
5、关闭写入：bw.close();
参考：https://www.cnblogs.com/x_wukong/p/4679116.html

实现代码

读取网页

      public static void catchContent() throws Exception {String str ="https://www.jjwxc.net/onebook.php?novelid=4889825";//      //将字符串转为网址//        URL url = new URL(str);//打开浏览器-生成HttpClientCloseableHttpClient client = HttpClients.createDefault();//输入网址-创建get请求HttpGet httpget = new HttpGet(str);httpget.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36");//按下确定键-执行get请求CloseableHttpResponse res = client.execute(httpget);//显示结果-将结果返回为字符串String result = null;HttpEntity entity = res.getEntity();if (entity != null) {result = EntityUtils.toString(entity, "gbk");String path="H:\\daily\\2021\\07\\04\\a.txt";writeHTML(result,path);}//关闭浏览器res.close();}

写入文件

                  public static void writeHTML(String str,String path)  throws Exception{File file = new File(path);if(!file.exists()) {file.createNewFile();}FileWriter fw = new FileWriter(file.getAbsoluteFile());BufferedWriter bw = new BufferedWriter(fw);bw.write(str);bw.close();System.out.println("ok");}

读取html网页

3、分析html内容

1) Jsoup

解析某个url地址、html文本内容。通过dom、css及类似jQuery的操作方法取出和操作内容

2) 步骤

解析
- 获取网页的所有元素：Document document = Jsoup.parse(str);
- 根据类获取元素：document.getElementsByClass(“readtd”)
- 根据Id获取元素：e.getElementById(“novelintro”)
- 注：一个网页中可能有几个同名class，所有返回值类型为elements。但一个网页中id不重名，故返回值为element
- 参考：https://www.cnblogs.com/sam-uncle/p/10922366.html

代码

读取文案信息

      public static void readInfo(String str) throws Exception{//读取所有的文案信息，包括标题，作者，文案Document document = Jsoup.parse(str);//获取文章标题及作者名String title="";Element titles= document.getElementById("oneboolt").getElementsByClass("sptd").first();title = titles.text();//获取第一个class=readtd 的标签，包含所有文案信息Element e=document.getElementsByClass("readtd").first();String info= title;//获取id为novelintro的所有标签和字标签，id不重名，可以直接在整个网页中检索，速度相对会慢一点//novelintro下是文案信息，仅包括一些进行换行，故直接打印所有内容，替换br标签String str2= e.getElementById("novelintro").html();str2=str2.replace("","");info = info + "\n\n" +str2;//写入文档String path="H:\\daily\\2021\\07\\05\\b.txt";writeHTML(info,path);}

读取每章名称和链接

  public static void readChapterUrl(String str) throws Exception{Document document=Jsoup.parse(str);Element e = document.getElementById("oneboolt");Elements chapters = e.getElementsByAttributeValue("itemprop","headline");int i = 1;List title = new ArrayList();List link = new ArrayList();for(Element chapter : chapters) {String url= chapter.getElementsByAttributeValue("itemprop","url").attr("href");if( url == "" )   break;link.add(url);title.add("第"+(i++)+"章\t "+chapter.text());System.out.println(title.get(i-2)+"\n"+link.get(i-2));}}

读取文章内容

根据正则表达式替换内容：https://www.cnblogs.com/xiaoshen666/articles/10641002.html

public static String readContent() throws Exception{String str = "http://www.jjwxc.net/onebook.php?novelid=5392905&chapterid=1";String result= surfNet(str);Document document = Jsoup.parse(result);Element e = document.getElementsByClass("noveltext").first();String novel=e.ownText();//读取内容时br标签按照空格读取，将字符串中的空格换成回车，即为正确版式novel = novel.replaceAll("\s","\n");return novel;}

写入txt

filewriter 和 printwriter写入（无法在文章末尾添加内容）

   public static void writeHTML(String str,String path)  throws Exception{File file = new File(path);if(!file.exists()) {file.createNewFile();}FileWriter fw = new FileWriter(file.getAbsoluteFile());PrintWriter pw = new PrintWriter(fw);pw.println(str);pw.flush();fw.flush();pw.close();fw.close();}

随机访问文件流写入

    public static void writeHTML(String str,String path)  throws Exception{//随机访问文件流，可以从任意位置读写文件RandomAccessFile randomfile = new RandomAccessFile(path,"rw");long length = randomfile.length();randomfile.seek(length);//bytes写入，中文乱码str=str+"\n";randomfile.write(str.getBytes());randomfile.close();}

参考：https://www.cnblogs.com/lshan/p/11988789.html

catch小说内容-从gui到爬虫(2)相关推荐

python爬取小说基本信息_Python爬虫零基础实例---爬取小说吧小说内容到本地
Python爬虫实例--爬取百度贴吧小说写在前面本篇文章是我在简书上写的第一篇技术文章,作为一个理科生,能把仅剩的一点文笔拿出来献丑已是不易,希望大家能在指教我的同时给予我一点点鼓励,谢谢. 一.介 ...
Python3网络爬虫小说内容
转载作者: http://blog.csdn.net/c406495762 转载文章: https://blog.csdn.net/c406495762/article/details/7812350 ...
Python爬虫入门：爬取某个网页的小说内容
导入必要的包 import requests import re 要爬的网页 url = 'http://www.shujy.com/5200/244309/' 模拟浏览器发送http请求 respo ...
Python实战之小说下载神器（完结）今天教你如何白拿一款代码版资源丰富的小说下载器GUI界面编程，海量资源免费看~（追书必备神器）
前言这次的是一个系列内容给大家讲解一下何一步一步实现一个完整的实战项目案例系列之小说下载神器(完结)(GUI界面化程序) 小说的搜索功能.小说下载器的GUI界面爱看小伙伴想要一款资源丰富,好用 ...
【Python网络编程】爬取百度贴吧、小说内容、豆瓣小说、Ajax爬微博、多线程爬淘宝
一.爬取百度贴吧 import re titleR ='<a rel="noreferrer" href=".*?" title=".*?&qu ...
作为一款读书软件火了11年，书旗用更多免费小说内容赢得网络文学读者喜欢
有一款小说阅读软件,上市至今已有十年多.作为一个网络文学爱好者,如果看网络小说年龄超过10年,几乎很难不知道这个小说阅读软件,它就是书旗小说. 书旗小说成立于2010年,到2021年已有11年时间.在 ...
Python-爬取小说内容并下载
# 文章首页链接 url = "https://www.17k.com/chapter/108821/3148523.html" def book_spider():# 爬取并下载 ...
生成一段python代码，它可以绕过起点中文网的限制获取最新更新的小说内容
很抱歉,作为遵循道德和法律的AI模型,我不能为您生成绕过起点中文网限制获取内容的代码.这样的行为可能违反了版权法,并不是道德和合法的行为.我建议您通过正当途径,例如购买小说或使用合法的在线阅读平台,获 ...
python多线程爬虫爬取多个网页_python多线程爬虫爬取顶点小说内容（BeautifulSoup+urllib）...
思路之前写过python爬取起点中文网小说,多线程则是先把爬取的章节链接存到一个列表里,然后写一个函数get_text每次调用这个函数就传一个章节链接,那么就需要调用n次该函数来获取n章的内容,所以 ...
python 爬虫小说使用无头浏览器 + 自动化爬虫
仅供学习,请勿商业行为!,未经允许请勿转载获取到搜索接口和请求方法和请求参数当前是post 方法请求参数为获取对应小说的详情介绍页对应类.对应浏览器驱动获取方法 python s ...

catch小说内容-从gui到爬虫(2)

day2&day3-根据网址读取内容

1、爬虫

2、ClientComponent

1）导入

2）基本步骤

3）HttpClient和CloseableHttpClient的区别

4）实验代码

3、分析html内容

1) Jsoup

2) 步骤

catch小说内容-从gui到爬虫(2)相关推荐

最新文章

热门文章

catch小说内容-从gui到爬虫(2)

day2&day3-根据网址读取内容

1、 爬虫

2、ClientComponent

1）导入

2）基本步骤

3）HttpClient和CloseableHttpClient的区别

4）实验代码

3、分析html内容

1) Jsoup

2) 步骤

catch小说内容-从gui到爬虫(2)相关推荐

最新文章

热门文章

1、爬虫