爬虫java的实现


文章目录

  • 爬虫java的实现
  • 前言
  • 一、selenium-java是什么?
  • 二、使用步骤
    • 爬虫目录结构
    • 引入库
    • 主方法代码
    • 封装数据实体类
    • 封装数据实体类
    • 工具类(Config)
    • 工具类(MyHttpUtil)
    • MySqlStrategy
    • 工具类(序列化与反序列化)
    • 工具类
  • 总结

前言

1 selenium-java+httpclient实现爬取页面,并且通过jdbc批量插入mysql
2 可解决开启请求监控,自动获取token,ajax数据加密返回,无法直接拿数据等问题
3 chromedriver的使用自行百度(如果步骤全对,还报错,请用管理员权限运行你开发工具)
4 注意:以下代码为demo,需自己根据实际业务修改


一、selenium-java是什么?

示例:selenium-java

二、使用步骤

爬虫目录结构

引入库

mavne依赖:

 <dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-java</artifactId><version>4.5.3</version></dependency><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.11.0</version></dependency><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.13</version></dependency><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><version>1.18.22</version><scope>provided</scope></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId><version>2.11.1</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-core</artifactId><version>2.11.1</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-annotations</artifactId><version>2.11.1</version></dependency><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>8.0.23</version></dependency>

主方法代码

代码如下(示例):

package test;import java.io.File;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;import org.apache.commons.io.FileUtils;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.devtools.DevTools;
import org.openqa.selenium.devtools.v106.network.Network;
import org.openqa.selenium.devtools.v106.network.model.Headers;
import org.openqa.selenium.devtools.v106.network.model.ResourceType;import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;import entity.Fa;
import util.MyHttpUtil;
import util.MySqlStrategy;
import util.SerializableUtil;
import util.Utils;/*** * * @author admin**/
public class CrawlerTest {private static String token = "xxxx";final static String driverAddr = "C:\\Users\\admin\\AppData\\Local\\Google\\Chrome\\Application\\chromedriver.exe";//登录final static String url1 = "https://xxxxx/system/login?";//获取详情接口final static String querySaasUrlTemplate = "https://xxxxxx?id=#{id}";final static String url2 = "https://xxxxxx?";final static String userName = "uername";final static String passWord = "password";final static File idCacheFile = new File("id.bat");//搜索名称持久化文件final static File searchNameFile = new File("searchName.bat");final static Set<String> idSet=getCacheSet(idCacheFile);final static Set<String> searchNameSet=getCacheSet(searchNameFile);public static void main(String[] args) {System.setProperty("webdriver.chrome.driver", driverAddr);// 设置浏览器optionsChromeOptions options = new ChromeOptions();// 关闭界面上的---Chrome正在受到自动软件的控制options.setExperimentalOption("excludeSwitches", new String[] { "enable-automation" });ChromeDriver driver = new ChromeDriver(options);Map<String, Object> command = new HashMap<>();// window.navigator.webdirvercommand.put("source", "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})");driver.executeCdpCommand("Page.addScriptToEvaluateOnNewDocument", command);// driver.executeScript("https://raw.githubusercontent.com/wendux/Ajax-hook/master/dist/ajaxhook.min.js");// driver.get("htps://www.baidu.com");// 首先登录driver.get(url1);driver.manage().window().maximize();Utils.sleep(5000);// 设置用户名driver.findElement(By.xpath("//*[@id=\"phone_number\"]")).sendKeys(userName);Utils.sleep(1000);// 设置密码driver.findElement(By.xpath("//*[@id=\"password\"]")).sendKeys(passWord);Utils.sleep(1000);// 勾选同意driver.findElement(By.xpath("//*[@id=\"agreement\"]")).click();Utils.sleep(1000);// 登录driver.findElement(By.xpath("//*[@id=\"root\"]/div/div[2]/div[1]/div[2]/div/div/form/div[4]/div/div/div/button")).click();// 获取window窗口句柄String handel1 = driver.getWindowHandle();Utils.sleep(1000);System.out.println("登录成功");Utils.sleep(3000);driver.get(url1);Utils.sleep(3000);// 打开一个新窗口String js = "window.open(\"" + url2 + "\");";((JavascriptExecutor) driver).executeScript(js);Utils.sleep(2000);// 切换窗口Object[] obj = driver.getWindowHandles().toArray();// 监听数据(下标为1的窗口)createRequestListener(1, driver);driver.switchTo().window(obj[1].toString());Utils.sleep(1000);//String searchName="搜索名称";//已经爬取过,不在获取if(searchNameSet.contains(searchName)){System.out.println(searchName+":已经处理过");return;}driver.findElement(By.xpath("//*[@id=\"name\"]")).sendKeys(searchName);// 查询driver.findElement(By.xpath("//*[@id=\"root\"]/section/section/div[2]/div/div[1]/div/form/div[6]/button")).click();Utils.sleep(2000);WebElement webElement = null;try {// 通过是否有下一页按钮,判断是否有数据(没有数据,这一行会抛出异常退出)webElement = driver.findElement(By.xpath("//*[@id=\"root\"]/section/section/div[2]/div/div[3]/div/div/div/div/div/ul/li[5]/button"));} catch (Exception exception) {// 跳出循环System.out.println("没有数据");}// 一个字处理完所有数据插入数据库List<Fa> faList = new ArrayList<>();// 为了防止死循环,最多1000次for (int i = 0; i < 1000; i++) {// 第一次数据不点击if (i != 0) {// 分页处理----// 判断是否有可以点击Boolean isEnabled = webElement.isEnabled();if (isEnabled) {// 可以点击webElement.click();// 点击完休眠等待Utils.sleep(2000);} else {// 不可以点击说明下一页处理完毕break;}// 每次点击后休眠2秒,取数据}// 说明有数据,直接获取WebElement tableWebElement = driver.findElement(By.xpath("//*[@id=\"root\"]/section/section/div[2]/div/div[3]/div/div/div/div/div/div/div/div/table/tbody"));List<WebElement> trList = tableWebElement.findElements(By.tagName("tr"));System.out.println("");System.out.println("当前数据页数:" + (i + 1));for (WebElement element : trList) {Utils.sleep(500);// System.out.println(element.getText().replace(" ", ""));// 获取详情数据按钮// WebElement// detailElement=element.findElement(By.xpath("//*[@id=\"root\"]/section/section/div[2]/div/div[3]/div/div/div/div/div/div/div/div/table/tbody/tr[1]/td[9]/div/span[1]"));// detailElement.click();// 等待获取json数据完成//判断该条数据是否已经完成String detailId = element.getAttribute("data-row-key");if(idSet.contains(detailId)) {//该条数据已经处理continue;}// 单位String unit = element.findElement(By.xpath("//td[5]")).getText().replace(" ", "");// 国家名称String countriesName = element.findElement(By.xpath("//td[7]")).getText().replace(" ", "");// 通过获取的id发送http请求String querySaasUrl = querySaasUrlTemplate.replace("#{id}", detailId);String result = MyHttpUtil.getRequest(token, querySaasUrl);// json解析数据ObjectMapper mapper = new ObjectMapper();// 定义一个转化对象try {JsonNode jsonNode = mapper.readTree(result);if ("200".equals(String.valueOf(jsonNode.get("code")))) {JsonNode dataNode = jsonNode.get("data");System.out.println(dataNode);Fa fa = mapper.readValue(dataNode.toString(), Fa.class);fa.setUnit(unit);fa.setCountriesName(countriesName);// 筛入ajax返回的所有数据fa.setRowData(dataNode.toString());faList.add(fa);} else {System.out.println("获取json数据失败");System.out.println(jsonNode.toPrettyString());System.exit(0);}} catch (Exception e) {System.out.print("数据解析异常:");e.printStackTrace();// 退出System.exit(0);}}// System.out.println(tableWebElement.getText());}// 插入数据到mysqlif(!faList.isEmpty()) {MySqlStrategy.insertValue(faList);}//将本次跑的参数缓存searchNameSet.add(searchName);for(Fa factory:faList) {idSet.add(factory.getRowId());  }//序列化SerializableUtil.serialization(searchNameFile,searchNameSet);SerializableUtil.serialization(idCacheFile, idSet);//try {Thread.currentThread().join();} catch (InterruptedException e) {e.printStackTrace();}}/*** 切换多个窗口需要多个监听* * @param i      窗口下标(只区哪个窗口监控的数据,无实际意义)* @param driver*/private static void createRequestListener(int i, ChromeDriver driver) {DevTools devTools = driver.getDevTools();devTools.createSession();devTools.send(Network.enable(java.util.Optional.empty(), java.util.Optional.empty(), java.util.Optional.empty()));devTools.addListener(Network.requestWillBeSent(), res -> {Utils.sleep(10);System.out.println("RequestHeaders:" + res.getRequest().getHeaders());System.out.println("RequestHeaders:" + res.getRequest().getUrl());Headers header = res.getRequest().getHeaders();synchronized (CrawlerTest.class) {if (header.containsKey("Authorization")) {token = (String) header.get("Authorization");// 获取token后销毁改监视器devTools.close();System.out.println("获取到了token:" + token);}}});}/*** 根据url获取ajax数据* * @param pattern* @param callback*/public static void interceptResponseXHRByUrl(int i, DevTools devTools) {devTools.addListener(Network.responseReceived(), responseReceived -> {try {if (ResourceType.XHR != responseReceived.getType()|| !responseReceived.getResponse().getUrl().contains("/xxxxxx")) {return;}// 取类型为XHRString data = "监控数据" + i + ":" + responseReceived.getType() + ":"+ responseReceived.getResponse().getUrl();Utils.sleep(2);FileUtils.write(new File("log/re.txt"), data, "UTF-8", true);FileUtils.write(new File("log/re.txt"), "\r\n", "UTF-8", true);devTools.send(Network.getResponseBody(responseReceived.getRequestId()));} catch (Exception e) {e.printStackTrace();} finally {}});}/*** 创建一个set集合* @return*/private static Set<String> getCacheSet(File file) {//Set<String> set=new LinkedHashSet<>();//反序列化值Set<String>  cacheSet=SerializableUtil.deserialization(file, set);if(cacheSet!=null) {set=cacheSet;}return set;}
}

封装数据实体类

代码如下(示例):

package entity;import java.util.List;import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
import com.fasterxml.jackson.annotation.JsonProperty;import lombok.Data;@Data
@JsonIgnoreProperties(ignoreUnknown = true)
public class Fa{private String rowData;@JsonProperty("id")private String rowId;private String unit;private String countriesName;private List<FaDetail> detailData;
}

封装数据实体类

代码如下(示例):

package entity;import com.fasterxml.jackson.annotation.JsonIgnoreProperties;import lombok.Data;@Data
@JsonIgnoreProperties(ignoreUnknown = true)
public class FaDetail {private Long faId;private String type;
}

工具类(Config)

代码如下(示例):

package util;public class Config {//驱动,8.0固定为该格式public static final String JDBC_DRIVER = "com.mysql.cj.jdbc.Driver";//数据库地址,修改该数据库名称public static final String DB_URL = "jdbc:mysql://192.168.111.102:3306/crawler?useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=Asia/Shanghai";//用户名public static final String USER = "root";//密码public static final String PASSWORD = "Sailing123`";
}

工具类(MyHttpUtil)

代码如下(示例):

package util;
import org.apache.http.ParseException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
public class MyHttpUtil {private static final String token="a7ee88f8-21d6-4b1d-bfa8-ff478a473304+1000001239406480";private static final String url="https://xxxxxx?id=xxxxxx";private static final CloseableHttpClient closeableHttpClient = HttpClients.createDefault();public static void main(String[] args) {getRequest(token,url);}public static String getRequest(String token,String url){HttpGet httpGet=new HttpGet(url.toString());httpGet.setHeader("authorization", token);try {CloseableHttpResponse closeableHttpResponse = closeableHttpClient.execute(httpGet);String responseString= EntityUtils.toString(closeableHttpResponse.getEntity());return responseString;} catch (ParseException | IOException e) {e.printStackTrace();System.out.println("请求数据出错,请排查问题");System.exit(1);}finally {//将连接放回连接池中(下次重新使用)httpGet.releaseConnection();}return null;}}

MySqlStrategy

代码如下(示例):

package util;import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.List;import entity.Fa;
import entity.FaDetail;public class MySqlStrategy {private final static String url = Config.DB_URL;private final static String user = Config.USER;private final static String password = Config.PASSWORD;private static Connection conn = getConnection();// ALTER TABLE factor AUTO_INCREMENT=1;public static void main(String[] args) {insertValue(null);}private static Connection getConnection() {try {conn = DriverManager.getConnection(url, user, password);} catch (SQLException e) {e.printStackTrace();}return conn;}public static void insertValue(List<Fa> datalist) {String sql = "insert into fa values(?,?,?,?,?)";String gasSql = "insert into fa_detail values(?,?,?)";try {conn.setAutoCommit(false);} catch (SQLException e2) {e2.printStackTrace();}try(PreparedStatement statement = conn.prepareStatement(sql, PreparedStatement.RETURN_GENERATED_KEYS);PreparedStatement detailStatement = conn.prepareStatement(gasSql,PreparedStatement.RETURN_GENERATED_KEYS)) {for (int i = 0; i < datalist.size(); i++) {Fa fa = datalist.get(i);creatFaParam(fa, statement);statement.addBatch();}statement.executeBatch();ResultSet generatedKeys = statement.getGeneratedKeys();List<Long> idList = new ArrayList<>();while (generatedKeys.next()) {idList.add(generatedKeys.getLong(1));}//关闭该结果集close(null,null,generatedKeys);// 给子表插入主表idfor (int i = 0; i < datalist.size(); i++) {Fa factory = datalist.get(i);List<FaDetail> detailList = factory.getDetailData();if (detailList != null) {for (FaDetail gas : detailList) {gas.setFaId(idList.get(i));// 准备批量数据creatFaDetailParam(detailStatement, gas);detailStatement.addBatch();}}}// 对子表进行批量插入detailStatement.executeBatch();conn.commit();} catch (Exception e1) {//回滚try {conn.rollback();} catch (SQLException e) {}//说明有重复的key,直接返回if(e1.getMessage().contains("Duplicate entry")) {return;}else {//退出程序,排查问题e1.printStackTrace();System.exit(1);}}}private static void creatFaDetailParam(PreparedStatement statement, FaDetail detail) throws SQLException {statement.setString(1, null);statement.setLong(2, detail.getFaId());statement.setString(3, detail.getType());}private static void creatFaParam(Fa fa, PreparedStatement statement) throws SQLException {statement.setString(1, null);statement.setString(2, fa.getRowData());statement.setLong(3, Long.valueOf(fa.getRowId()));statement.setString(4, fa.getUnit());statement.setString(5, fa.getCountriesName());}public static void close(Connection connection, Statement statement, ResultSet resultSet) {try {if (connection != null)connection.close();} catch (Exception e) {e.printStackTrace();}try {if (statement != null)statement.close();} catch (Exception e) {e.printStackTrace();}try {if (resultSet != null)resultSet.close();} catch (Exception e) {e.printStackTrace();}}
}

工具类(序列化与反序列化)

代码如下(示例):

package util;import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.util.LinkedHashSet;
import java.util.Set;public class SerializableUtil{public static void main(String[] args) {File file=new File("test.dat");Set<String> set=new LinkedHashSet<>();set.add("hello");SerializableUtil.serialization(file, set);Set<String> set1=SerializableUtil.deserialization(file,new LinkedHashSet<String>());System.out.println(set1);}public static <T> void serialization(File file, T t) {try {ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(file));oos.writeObject(t);oos.flush();oos.close();} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}}public static <T> T deserialization(File file, T t) {if (!file.exists()) {return null;}try {ObjectInputStream ois = new ObjectInputStream(new FileInputStream(file));t = (T) ois.readObject();ois.close();return t;} catch (Exception e) {e.printStackTrace();}return null;}
}

工具类

代码如下(示例):

package util;public class Utils {public static void sleep(Integer time){try {Thread.sleep(time);} catch (InterruptedException e) {//}}}

总结

selenium-java结合httpclient满足大部分网站爬虫代码就到这儿了

爬虫selenium-java相关推荐

  1. Java爬虫Selenium+Java+ChromeDriver【抓取百度信息】

    一.爬虫工具 selenium 是一个模拟浏览器操作的工具,背后有google 维护源代码,支持全部主流浏览器,支持主流的编程语言,包括:java,Python,C#,PHP,Ruby,等,在本项目上 ...

  2. Selenium+Java+TestNG环境配置

    1. JDK 2.eclipse+TestNG >TestNG安装.   Name:testng  Location:http://beust.com/eclipse.如图: 3.seleniu ...

  3. Selenium java环境搭建

    一.下载selenium java包 登录到http://docs.seleniumhq.org/download/选择"java"并download 解压压缩包 二.安装JDK, ...

  4. [Python爬虫] Selenium获取百度百科旅游景点的InfoBox消息盒

    前面我讲述过如何通过BeautifulSoup获取维基百科的消息盒,同样可以通过Spider获取网站内容,最近学习了Selenium+Phantomjs后,准备利用它们获取百度百科的旅游景点消息盒(I ...

  5. [Python爬虫] Selenium+Phantomjs动态获取CSDN下载资源信息和评论

    前面几篇文章介绍了Selenium.PhantomJS的基础知识及安装过程,这篇文章是一篇应用.通过Selenium调用Phantomjs获取CSDN下载资源的信息,最重要的是动态获取资源的评论,它是 ...

  6. [Python爬虫] Selenium实现自动登录163邮箱和Locating Elements介绍

    前三篇文章介绍了安装过程和通过Selenium实现访问Firefox浏览器并自动搜索"Eastmount"关键字及截图的功能.而这篇文章主要简单介绍如何实现自动登录163邮箱,同时 ...

  7. [python爬虫] Selenium常见元素定位方法和操作的学习介绍(转载)

    转载地址:[python爬虫] Selenium常见元素定位方法和操作的学习介绍 一. 定位元素方法 官网地址:http://selenium-python.readthedocs.org/locat ...

  8. Selenium Java教程– Selenium中的类名定位器

    Selenium中CSS Locator是编写脚本的最重要方面之一. 如果您无法通过使用Selenium中的任何CSS定位器来定位元素,那么精通Selenium自动化将是一项艰巨的任务. 硒提供多种定 ...

  9. Selenium +Java自动化环境安装

    selenium+java+testng+maven+spring+mybatis 第一步,安装JDK 1.  下载JDK1.8并安装在某一路径下 2.  配置环境变量 a.新建JAVA_HOME:填 ...

  10. java写html的多选框,Selenium+java - 单选框及复选框处理

    Selenium+java - 单选框及复选框处理 一.什么是单选框.复选框? 二.被测页面html源代码 CheckBoxRadioDemo.html CheckBox.Radio练习案例 复选框 ...

最新文章

  1. 推荐系统发展的六大影响因子 | 深度
  2. 【crontab】误删crontab及其恢复
  3. golang 系统调用 syscall 简介
  4. replace into
  5. ssh中exit命令退出远程服务器_解决Linux关闭终端(关闭SSH等)后运行的程序或者服务自动停止...
  6. Unity Gamma校正转为线性空间
  7. r语言 python 书_推荐关于R的几本书
  8. Linux网络模块全局变量,()不是Linux系统的特色.
  9. linux提升权限命令提示符,win10如何直接使用命令提示符提高管理员权限?
  10. 数据结构之树与二叉树的应用:哈夫曼树(最优二叉树)
  11. 仿苹果涂鸦软件_ipad平板电脑有哪些绘画软件?
  12. mtd设备操作、jffs2
  13. 低代码|零代码云快速开发平台测评
  14. 弱口令扫描工具mysql ftp_超级弱口令检查工具
  15. 生物信息学biojava|从本地读取并解析遍历genbank文件|从genbank中提取CDS等注释信息
  16. Protein Cell:扩增子和宏基因组数据分析实用指南
  17. 海贼王剧场版:Z 剧情详解(附TS无字幕版地址)
  18. Object.entries()方法的使用和实现
  19. you-get:使用命令行工具下载网络资源,可下载 B 站视频
  20. vuejs视图不能及时更新的问题 ,深入响应式原理

热门文章

  1. 垃圾回收篇~~垃圾回收算法
  2. 关于2023年专利申请的流程及费用
  3. Oracle中nvl()与nvl2()函数详解
  4. MFC OnContextMenu
  5. RabbitMQ 之死信队列
  6. 设备管理 android问号,设备管理器有感叹号和问号未知设备的解决方法
  7. node.js+Express计算机毕业设计行程规划app(程序+LW+部署)
  8. 多个视频怎么合并成一个并且添加片头片尾字幕
  9. 前端自动化构建工具--Plop
  10. HTML全角空格语法