Java爬虫爬取天猫淘宝京东搜索页和商品详情

先识别商品url，区分平台提取商品编号，再根据平台带着商品编号爬取数据。

1.导包

<!-- 爬虫相关Jar包依赖 --><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>3.10-FINAL</version></dependency><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.3</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.11.3</version></dependency><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><scope>provided</scope></dependency>

2.封装返回类型和常量

引入lombok 注入@Data 来避免写get set toString等重复代码

package java1024.xyz.vo;import lombok.Data;/*** @author xivin* @email 1250402127@qq.com* @description* @date 2020/1/3*/
@Data
public class UrlData {private int status;private String platform;private Long number;}package java1024.xyz.vo;/**
* 常量接口
**/
public interface UrlConst {String taobaoUrlSign = "taobao.com";String tmallUrlSign = "tmall.com";String jingdongUrlSign = "jd.com";String TMALL_PRODUCT_DETAIL = "https://detail.tmall.com/item.htm?id=";String TAOBAO_PRODUCT_DETAIL = "https://item.taobao.com/item.htm?id=";String JD_PRODUCT_DETAIL = "https://item.jd.com/";
}

import com.fasterxml.jackson.annotation.JsonFormat;
import lombok.Data;import java.io.Serializable;
import java.sql.Timestamp;/*** @author xivin* @email 1250402127@qq.com* @description 商品实体类* @date 2020/1/3*/
@Data
public class Product implements Serializable {private Long id;private Long number;private Float price;private Integer userId;private String url;private Integer platformId;private String title;private String describe;private Integer status;@JsonFormat( pattern="yyyy-MM-dd HH:mm:ss")private Timestamp createdAt;private Timestamp updatedAt;}

3.前期工作做好后开始封装识别url工具 UrlUtils.java

/*** @author xivin* @email 1250402127@qq.com* @description* @date 2020/1/3*/
public class UrlUtils {public static UrlData analyseUrl(String url) {UrlData urlData = new UrlData();try {// 判空if (StringUtils.isEmpty(url)) {urlData.setStatus(0);return urlData;}//天猫if (url.contains(UrlConst.tmallUrlSign)) {urlData.setPlatform(UrlConst.tmallUrlSign);String numberStr = "";/*** 切分根路径 和 参数 如：* https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.8.27832a99AfoD5W&id=604433373792* 在 ？问号的地方切成两部分**/String[] roudAndParams = url.split("\\?");if (roudAndParams.length < 2) {urlData.setStatus(0);return urlData;}/*** 获取 参数字符串，通过&切开多个参数，提取以 id=开头的即 商品id*/String paramStr =  roudAndParams[1];String[] params = paramStr.split("&");for (int i = 0;i < params.length; i++) {if (params[i].startsWith("id=")) {numberStr = params[i].split("id=")[1];break;}}if (StringUtils.isEmpty(numberStr)) {urlData.setStatus(0);return urlData;}Long number = new Long(numberStr);urlData.setStatus(1);urlData.setNumber(number);return urlData;}//淘宝else if (url.contains(UrlConst.taobaoUrlSign)) {urlData.setPlatform(UrlConst.taobaoUrlSign);String numberStr = "";/*** 切分根路径 和 参数 如：* https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.8.27832a99AfoD5W&id=604433373792* 在 ？问号的地方切成两部分**/String[] roudAndParams = url.split("\\?");if (roudAndParams.length < 2) {urlData.setStatus(0);return urlData;}/*** 获取 参数字符串，通过&切开多个参数，提取以 id=开头的即 商品id*/String paramStr =  roudAndParams[1];String[] params = paramStr.split("&");for (int i = 0;i < params.length; i++) {if (params[i].startsWith("id=")) {numberStr = params[i].split("id=")[1];break;}}if (StringUtils.isEmpty(numberStr)) {urlData.setStatus(0);return urlData;}Long number = new Long(numberStr);urlData.setStatus(1);urlData.setNumber(number);return urlData;}//其他else if (url.contains(UrlConst.jingdongUrlSign)) {urlData.setPlatform(UrlConst.jingdongUrlSign);String numberStr = "";String[] roudAndParams = url.split("jd\\.com/");if (roudAndParams.length < 2) {urlData.setStatus(0);return urlData;}String paramStr =  roudAndParams[1];String[] params = paramStr.split(".html");numberStr = params[0];if (StringUtils.isEmpty(numberStr)) {urlData.setStatus(0);return urlData;}Long number = new Long(numberStr);urlData.setStatus(1);urlData.setNumber(number);return urlData;}else {urlData.setStatus(0);return urlData;}}catch (Exception e) {e.printStackTrace();urlData.setStatus(0);return urlData;}}public static void main(String[] args) {String tmallUrl = "https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.8.27832a99AfoD5W&id=604433373792&skuId=4233630160968&user_id=1776477331&cat_id=2&is_b=1&rn=2eff85a6a504024ee62222a0045d9ded";UrlData tmall = analyseUrl(tmallUrl);System.out.println("tmall = " + tmall);String taobaoUrl =  "https://s.taobao.com/search?spm=a230r.1.14.7.ade0695abTrJ6k&type=samestyle&app=i2i&rec_type=1&uniqpid=69915374&nid=604733501729";UrlData taobao = analyseUrl(taobaoUrl);System.out.println("taobao = " + taobao);String jdUrl = "https://item.jd.com/100004250098.html#none";UrlData jd = analyseUrl(jdUrl);System.out.println("jd = " + jd);}}

4.爬取天猫方法

Java 爬虫说明：创建HttpClient，设置请求头，执行请求，解析相应！具体代码也有相应的解析

public Product soupTmallDetailById(Long number) {try {// 需要爬取商品信息的网站地址String url = "https://chaoshi.detail.tmall.com/item.htm?id=" + number;// 动态模拟请求数据CloseableHttpClient httpclient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(url);// 模拟浏览器浏览（user-agent的值可以通过浏览器浏览，查看发出请求的头文件获取）httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36");CloseableHttpResponse response = httpclient.execute(httpGet);// 获取响应状态码int statusCode = response.getStatusLine().getStatusCode();try {HttpEntity entity = response.getEntity();// 如果状态响应码为200，则获取html实体内容或者json文件if (statusCode == 200) {String html = EntityUtils.toString(entity, Consts.UTF_8);// 提取HTML得到商品信息结果Document doc = null;// doc获取整个页面的所有数据doc = Jsoup.parse(html);//输出doc可以看到所获取到的页面源代码//System.out.println(doc);// 通过浏览器查看商品页面的源代码，找到信息所在的div标签，再对其进行一步一步地解析Element item = doc.select("div[class='tb-wrap']").get(0);//Elements liList = ulList.select("div[class='product']");// 循环liList的数据（具体获取的数据值还得看doc的页面源代码来获取，可能稍有变动）//System.out.println("item = " + item);Product product = new Product();//for (Element item : ulList) {// 商品IDtry {product.setNumber(number);product.setPlatformId(1);//String id = item.select("div[class='tb-detail-hd']").select("h1").attr("data-spm");String title = item.select("div[class='tb-detail-hd']").select("h1").text();product.setTitle(title);product.setUrl(UrlConst.TMALL_PRODUCT_DETAIL+number);System.out.println("商品title：" + title);//String priceStr = item.select("div[class='tm-price-panel']").select("div[class='tm-promo-type']").select("span[class='tm-price']").text();return product;}catch (Exception e) {product.setId(0L);product.setTitle("商品不存在");return product;}// }}}catch (Exception e) {e.printStackTrace();Product product = new Product();product.setId(0L);product.setTitle("商品不存在");return product;}}catch (Exception e) {e.printStackTrace();}return null;}

5.爬取京东商品详情方法

public Product soupTaobaoDetailById(Long number) {try {// 需要爬取商品信息的网站地址String url = "https://item.taobao.com/item.htm?id=" + number;// 动态模拟请求数据CloseableHttpClient httpclient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(url);// 模拟浏览器浏览（user-agent的值可以通过浏览器浏览，查看发出请求的头文件获取）httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36");CloseableHttpResponse response = httpclient.execute(httpGet);// 获取响应状态码int statusCode = response.getStatusLine().getStatusCode();try {HttpEntity entity = response.getEntity();// 如果状态响应码为200，则获取html实体内容或者json文件if (statusCode == 200) {String html = EntityUtils.toString(entity, Consts.UTF_8);// 提取HTML得到商品信息结果Document doc = null;// doc获取整个页面的所有数据doc = Jsoup.parse(html);//输出doc可以看到所获取到的页面源代码//System.out.println(doc);// 通过浏览器查看商品页面的源代码，找到信息所在的div标签，再对其进行一步一步地解析Element item = doc.select("div[class='tb-item-info-r']").get(0);//Elements liList = ulList.select("div[class='product']");// 循环liList的数据（具体获取的数据值还得看doc的页面源代码来获取，可能稍有变动）//System.out.println("item = " + item);Product product = new Product();//for (Element item : ulList) {// 商品IDtry {product.setNumber(number);product.setPlatformId(2);//String id = item.select("div[class='tb-detail-hd']").select("h1").attr("data-spm");String title = item.select("div[class='tb-title']").select("h3").text();product.setTitle(title);product.setUrl(UrlConst.TAOBAO_PRODUCT_DETAIL+number);System.out.println("商品title：" + title);return product;}catch (Exception e) {product.setId(0L);product.setTitle("商品不存在");return product;}// }}}catch (Exception e) {e.printStackTrace();Product product = new Product();product.setId(0L);product.setTitle("商品不存在");return product;}}catch (Exception e) {e.printStackTrace();}return null;}

6.天猫搜索功能

public List<Product> soupTaobaoByKeyWord(String keyword) {try {String input = "毛巾";// 需要爬取商品信息的网站地址 实际中把input改成 keywordString url = "https://list.tmall.com/search_product.htm?q=" + input;// 动态模拟请求数据CloseableHttpClient httpclient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(url);// 模拟浏览器浏览（user-agent的值可以通过浏览器浏览，查看发出请求的头文件获取）httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36");CloseableHttpResponse response = httpclient.execute(httpGet);// 获取响应状态码int statusCode = response.getStatusLine().getStatusCode();try {HttpEntity entity = response.getEntity();// 如果状态响应码为200，则获取html实体内容或者json文件if (statusCode == 200) {String html = EntityUtils.toString(entity, Consts.UTF_8);// 提取HTML得到商品信息结果Document doc = null;// doc获取整个页面的所有数据doc = Jsoup.parse(html);//输出doc可以看到所获取到的页面源代码//      System.out.println(doc);// 通过浏览器查看商品页面的源代码，找到信息所在的div标签，再对其进行一步一步地解析Elements ulList = doc.select("div[class='view grid-nosku']");Elements liList = ulList.select("div[class='product']");// 循环liList的数据（具体获取的数据值还得看doc的页面源代码来获取，可能稍有变动）for (Element item : liList) {// 商品IDString id = item.select("div[class='product']").select("p[class='productStatus']").select("span[class='ww-light ww-small m_wangwang J_WangWang']").attr("data-item");System.out.println("商品ID：" + id);// 商品名称String name = item.select("p[class='productTitle']").select("a").attr("title");System.out.println("商品名称：" + name);// 商品价格String price = item.select("p[class='productPrice']").select("em").attr("title");System.out.println("商品价格：" + price);// 商品网址String goodsUrl = item.select("p[class='productTitle']").select("a").attr("href");System.out.println("商品网址：" + goodsUrl);// 商品图片网址String imgUrl = item.select("div[class='productImg-wrap']").select("a").select("img").attr("data-ks-lazyload");System.out.println("商品图片网址：" + imgUrl);System.out.println("------------------------------------");}// 消耗掉实体EntityUtils.consume(response.getEntity());} else {// 消耗掉实体EntityUtils.consume(response.getEntity());}} finally {response.close();}}catch (Exception e) {e.printStackTrace();}return null;}

7.利用爬虫技术完成的一个商品历史价格记录网站项目——值得吗？价格记录网站 github地址：https://github.com/xivinChen/zhi-de-ma

Java爬虫爬取天猫淘宝京东搜索页和商品详情相关推荐

java爬虫爬取天猫指定店铺下全部商品详细信息(实时价格、尺码、库存等) 超详细（思路篇）！
前言前段时间需要做一个获取天猫店铺中所有商品详情的程序,包括获取对应的商品的尺码.吊牌价.实时售价(促销价).库存等信息.自己倒是写过一些爬虫,不过对于这类电商信息的爬取倒是第一次接触,听说天猫的反 ...
scrapy 爬取天猫淘宝的某个商品评论
这是商品的原url https://detail.tmall.com/item.htm?spm=a220m.1000858.1000725.6.43a61af4WhDqVT&id=541017 ...
爬取天猫国际、京东全球购、淘宝全球购的商品数据
公司内部mini项目–智慧选品 "智慧选品"项目主要是方便采购人员了解其他竞品平台的商品数据,将其他平台上卖的特别好的商品数据展示给采购人员,方便他们去采购商品,扩大公司自己的商品 ...
python java 爬数据_如何用java爬虫爬取网页上的数据
当我们使用浏览器处理网页的时候,有时候是不需要浏览的,例如使用PhantomJS适用于无头浏览器,进行爬取网页数据操作.最近在进行java爬虫学习的小伙伴们有没有想过如何爬取js生成的网络页面吗?别急 ...
20210507新版友价框架制作江雀网店交易天猫淘宝京东拼多多唯品会网店转让送手机版系统
20210507新版友价框架制作江雀网店交易天猫淘宝京东拼多多唯品会网店转让送手机版系统本套源码演示地址:http://jq.94gan.net(pc端) 手机版 :http://jq.94gan. ...
Java爬虫 --- 爬取王者荣耀英雄图片
Java爬虫 - 爬取王者荣耀英雄图片 import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Docu ...
Java爬虫爬取某招聘网站招聘信息
Java爬虫爬取某招聘网站招聘信息一.系统介绍二.功能展示 1.需求爬取的网站内容 2.实现流程 2.1数据采集 2.2页面解析 2.3数据存储三.获取源码一.系统介绍系统主要功能:本项目 ...
java爬虫爬取笔趣阁小说
java爬虫爬取笔趣阁小说 package novelCrawler;import org.jsoup.Connection; import org.jsoup.HttpStatusException ...
Java爬虫爬取wallhaven的图片
Java爬虫爬取wallhaven的图片参考文章:JAVA Jsoup爬取网页图片下载到本地需要的jar包:jsuop wallhaven网站拒绝java程序访问,所以要伪装报头. 发送请求时 C ...

Java爬虫爬取天猫淘宝京东搜索页和商品详情

Java爬虫爬取天猫淘宝京东搜索页和商品详情

1.导包

2.封装返回类型和常量

3.前期工作做好后开始封装识别url工具 UrlUtils.java

4.爬取天猫方法

5.爬取京东商品详情方法

6.天猫搜索功能

7.利用爬虫技术完成的一个商品历史价格记录网站项目——值得吗？价格记录网站 github地址：https://github.com/xivinChen/zhi-de-ma

Java爬虫爬取天猫淘宝京东搜索页和商品详情相关推荐

最新文章

热门文章

Java爬虫爬取 天猫 淘宝 京东 搜索页和 商品详情

Java爬虫爬取 天猫 淘宝 京东 搜索页和 商品详情

1.导包

2.封装返回类型和常量

3.前期工作做好后开始封装 识别url工具 UrlUtils.java

4.爬取天猫方法

5.爬取京东商品详情方法

6.天猫搜索功能

7.利用爬虫技术完成的一个 商品历史价格记录网站 项目——值得吗？价格记录网站 github地址：https://github.com/xivinChen/zhi-de-ma

Java爬虫爬取 天猫 淘宝 京东 搜索页和 商品详情相关推荐

最新文章

热门文章

Java爬虫爬取天猫淘宝京东搜索页和商品详情

Java爬虫爬取天猫淘宝京东搜索页和商品详情

3.前期工作做好后开始封装识别url工具 UrlUtils.java

7.利用爬虫技术完成的一个商品历史价格记录网站项目——值得吗？价格记录网站 github地址：https://github.com/xivinChen/zhi-de-ma

Java爬虫爬取天猫淘宝京东搜索页和商品详情相关推荐