一、Eclipse+WebMagic配置

在Eclipse中配置WebMagic之前一篇文章有介绍,Eclipse下配置WebMagic,仅供参考,还可以通过添加依赖的方式配置WebMagic。

二、爬虫步骤

1.由于反爬机制的存在,如果不进行伪装,对方服务器会将爬虫屏蔽,此时要进行浏览器的伪装

private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36");

2.选取一个京东手机品牌界面,比如选取apple界面

public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655";

3.程序入口爬取设置,设置6个线程、同时以json文件格式存储到指定的位置

public static void main(String[] args) {JDSpider jDS = new JDSpider();Spider spider = Spider.create(jDS);spider.addUrl(starts);//json数据存入目录spider.addPipeline(new JsonFilePipeline("C:\\Users\\xxx\\Desktop\\test"));spider.thread(6);spider.run();}

其中addPipeline后续所跟的位置为自己创建的一个存储文件夹的位置
4.对于一些手机信息界面是动态的,比如手机价格、好评总数等等,会不断变化,于时这些数据便存在单独的json格式的网页中,而不是爬取到的静态页面中,所以需要一个能将url网址转换成json数据的函数

//url获取json数据
public String loadJson(String url) {StringBuilder json = new StringBuilder();try {URL urlObject = new URL(url);URLConnection uc = urlObject.openConnection();BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK"));String inputLine = null;while ((inputLine = in.readLine()) != null) {json.append(inputLine);}in.close();} catch (Exception e) {e.printStackTrace();}return json.toString();
}

5.对于apple手机(其他品牌同理),在一个界面上只能有最多个固定数量的同品牌手机个数,但有很多个界面,所以网址之间一定存在某种规律,找到规律,得到所有大类界面(一个界面有多个手机)的网址

for (int j = 0; j < 11; j++) {String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0";page.addTargetRequest(item);
}

6.之后得到所有手机界面的网址,通过xpath找到大类界面上所有的手机界面网址。首先判断符合条件的大类网址

if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E"))

然后获取所有的手机

// 获取所有手机页面
List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all();
for (int j = 0; j < items.size(); j++) {String itemsNew = "https:" + items.get(j);page.addTargetRequest(itemsNew);
}

7接下来获取所需要的手机属性,如下获取手机名称和url

// 手机名称
String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get();
page.putField("name", name);
// 获取手机url
String id = urltwo.replace("https://item.jd.com/", "").replace(".html", "");
page.putField("url", id);

对于价格、好评度等动态界面,需要通过解析来获取到具体的json数据
(1)价格的url举例:

https://p.3.cn/prices/mgets?skuIds=J_10026711061553

提取到的规律:

String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;

其中id为上一步中所获得的id,根据价格界面的构造,来书写JavaBean类,对价格解析,其中JavaBean类如下

public class Price {private String op;private String m;private String cbf;private String id;private String p;public String getOp() {return op;}public void setOp(String op) {this.op = op;}public String getM() {return m;}public void setM(String m) {this.m = m;}public String getCbf() {return cbf;}public void setCbf(String cbf) {this.cbf = cbf;}public String getId() {return id;}public void setId(String id) {this.id = id;}public String getP() {return p;}public void setP(String p) {this.p = p;}
}

对价格的json解析和爬取

String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class);
page.putField("price", prices.get(0).getP());

(2)好评总数的url举例:

https://club.jd.com/comment/productCommentSummaries.action?referenceIds=10026711061553

JavaBean类

public class CommentsCount {private long SkuId;private long ProductId;private int ShowCount;private String ShowCountStr;private String CommentCountStr;private int CommentCount;private int AverageScore;private String DefaultGoodCountStr;private int DefaultGoodCount;private String GoodCountStr;private int GoodCount;private int AfterCount;private int OneYear;private String AfterCountStr;private int VideoCount;private String VideoCountStr;private double GoodRate;private int GoodRateShow;private int GoodRateStyle;private String GeneralCountStr;private int GeneralCount;private double GeneralRate;private int GeneralRateShow;private int GeneralRateStyle;private String PoorCountStr;private int PoorCount;private int SensitiveBook;private double PoorRate;private int PoorRateShow;private int PoorRateStyle;public void setSkuId(long SkuId) {this.SkuId = SkuId;}public long getSkuId() {return SkuId;}public void setProductId(long ProductId) {this.ProductId = ProductId;}public long getProductId() {return ProductId;}public void setShowCount(int ShowCount) {this.ShowCount = ShowCount;}public int getShowCount() {return ShowCount;}public void setShowCountStr(String ShowCountStr) {this.ShowCountStr = ShowCountStr;}public String getShowCountStr() {return ShowCountStr;}public void setCommentCountStr(String CommentCountStr) {this.CommentCountStr = CommentCountStr;}public String getCommentCountStr() {return CommentCountStr;}public void setCommentCount(int CommentCount) {this.CommentCount = CommentCount;}public int getCommentCount() {return CommentCount;}public void setAverageScore(int AverageScore) {this.AverageScore = AverageScore;}public int getAverageScore() {return AverageScore;}public void setDefaultGoodCountStr(String DefaultGoodCountStr) {this.DefaultGoodCountStr = DefaultGoodCountStr;}public String getDefaultGoodCountStr() {return DefaultGoodCountStr;}public void setDefaultGoodCount(int DefaultGoodCount) {this.DefaultGoodCount = DefaultGoodCount;}public int getDefaultGoodCount() {return DefaultGoodCount;}public void setGoodCountStr(String GoodCountStr) {this.GoodCountStr = GoodCountStr;}public String getGoodCountStr() {return GoodCountStr;}public void setGoodCount(int GoodCount) {this.GoodCount = GoodCount;}public int getGoodCount() {return GoodCount;}public void setAfterCount(int AfterCount) {this.AfterCount = AfterCount;}public int getAfterCount() {return AfterCount;}public void setOneYear(int OneYear) {this.OneYear = OneYear;}public int getOneYear() {return OneYear;}public void setAfterCountStr(String AfterCountStr) {this.AfterCountStr = AfterCountStr;}public String getAfterCountStr() {return AfterCountStr;}public void setVideoCount(int VideoCount) {this.VideoCount = VideoCount;}public int getVideoCount() {return VideoCount;}public void setVideoCountStr(String VideoCountStr) {this.VideoCountStr = VideoCountStr;}public String getVideoCountStr() {return VideoCountStr;}public void setGoodRate(double GoodRate) {this.GoodRate = GoodRate;}public double getGoodRate() {return GoodRate;}public void setGoodRateShow(int GoodRateShow) {this.GoodRateShow = GoodRateShow;}public int getGoodRateShow() {return GoodRateShow;}public void setGoodRateStyle(int GoodRateStyle) {this.GoodRateStyle = GoodRateStyle;}public int getGoodRateStyle() {return GoodRateStyle;}public void setGeneralCountStr(String GeneralCountStr) {this.GeneralCountStr = GeneralCountStr;}public String getGeneralCountStr() {return GeneralCountStr;}public void setGeneralCount(int GeneralCount) {this.GeneralCount = GeneralCount;}public int getGeneralCount() {return GeneralCount;}public void setGeneralRate(double GeneralRate) {this.GeneralRate = GeneralRate;}public double getGeneralRate() {return GeneralRate;}public void setGeneralRateShow(int GeneralRateShow) {this.GeneralRateShow = GeneralRateShow;}public int getGeneralRateShow() {return GeneralRateShow;}public void setGeneralRateStyle(int GeneralRateStyle) {this.GeneralRateStyle = GeneralRateStyle;}public int getGeneralRateStyle() {return GeneralRateStyle;}public void setPoorCountStr(String PoorCountStr) {this.PoorCountStr = PoorCountStr;}public String getPoorCountStr() {return PoorCountStr;}public void setPoorCount(int PoorCount) {this.PoorCount = PoorCount;}public int getPoorCount() {return PoorCount;}public void setSensitiveBook(int SensitiveBook) {this.SensitiveBook = SensitiveBook;}public int getSensitiveBook() {return SensitiveBook;}public void setPoorRate(double PoorRate) {this.PoorRate = PoorRate;}public double getPoorRate() {return PoorRate;}public void setPoorRateShow(int PoorRateShow) {this.PoorRateShow = PoorRateShow;}public int getPoorRateShow() {return PoorRateShow;}public void setPoorRateStyle(int PoorRateStyle) {this.PoorRateStyle = PoorRateStyle;}public int getPoorRateStyle() {return PoorRateStyle;}
}
import java.util.List;public class JsonRootBean {    private List<CommentsCount> CommentsCount;public void setCommentsCount(List<CommentsCount> CommentsCount) {this.CommentsCount = CommentsCount;}public List<CommentsCount> getCommentsCount() {return CommentsCount;}
}

对好评总数的json解析和爬取

String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id;
JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class);
List<CommentsCount> count = jrt.getCommentsCount();
page.putField("goodRateShow", count.get(0).getGoodRateShow());
page.putField("comment", count.get(0).getCommentCountStr());

到此,一个简单的爬虫程序就基本完成了,爬取到的所有数据都存储到了自己所创建的那个文件夹中。

三、程序源代码

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
public class JDSpider implements PageProcessor {// 浏览器伪装private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36");// oneplus手机开始界面public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655";public static void main(String[] args) {JDSpider jDS = new JDSpider();Spider spider = Spider.create(jDS);spider.addUrl(starts);//json数据存入目录,一共346个数据族spider.addPipeline(new JsonFilePipeline("C:\\Users\\zxf\\Desktop\\testnew"));spider.thread(6);spider.run();}//url获取json数据public String loadJson(String url) {StringBuilder json = new StringBuilder();try {URL urlObject = new URL(url);URLConnection uc = urlObject.openConnection();BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK"));String inputLine = null;while ((inputLine = in.readLine()) != null) {json.append(inputLine);}in.close();} catch (Exception e) {e.printStackTrace();}return json.toString();}@Overridepublic void process(Page page) {// List<String> items = page.getHtml().xpath("//div[@class='p-name// p-name-type-3']/a/@href").all();for (int j = 0; j < 11; j++) {String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0";page.addTargetRequest(item);}String urlone = page.getRequest().getUrl();if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E")) {// 获取所有手机页面List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all();for (int j = 0; j < items.size(); j++) {String itemsNew = "https:" + items.get(j);page.addTargetRequest(itemsNew);}// 跳过处理page.setSkip(true);}String urltwo = page.getRequest().getUrl();// 获取所有的手机idif (urltwo.startsWith("https://item.jd.com")) {// 商品名称String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get();page.putField("name", name);// 获取手机urlString id = urltwo.replace("https://item.jd.com/", "").replace(".html", "");page.putField("url", id);// 价格链接:https://p.3.cn/prices/mgets?skuIds=J_10026711061553// 评价链接:https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100014348492&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1// https://club.jd.com/comment/productCommentSummaries.action?referenceIds=100014348492// 商品评价链接String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id;JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class);List<CommentsCount> count = jrt.getCommentsCount();page.putField("goodRateShow", count.get(0).getGoodRateShow());page.putField("comment", count.get(0).getCommentCountStr());String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class);page.putField("price", prices.get(0).getP());}}@Overridepublic Site getSite() {// TODO Auto-generated method stubreturn site;}
}

同时还需要上边的3个JavaBean类,程序具体代码已在上边给出啦!

爬虫第一弹——爬取京东手机信息相关推荐

  1. layui获取input信息_python爬虫—用selenium爬取京东商品信息

    python爬虫--用selenium爬取京东商品信息 1.先附上效果图(我偷懒只爬了4页) 2.京东的网址https://www.jd.com/ 3.我这里是不加载图片,加快爬取速度,也可以用Hea ...

  2. go爬虫和python爬虫哪个好_python 爬虫实战项目--爬取京东商品信息(价格、优惠、排名、好评率等)-Go语言中文社区...

    利用splash爬取京东商品信息 一.环境 window7 python3.5 pycharm scrapy scrapy-splash MySQL 二.简介 为了体验scrapy-splash 的动 ...

  3. 爬虫实战:爬取京东手机图片并保存到本地

    先看一下效果: 这个爬虫的功能是将京东上的手机图片爬取并保存下来,其思路和我上一篇博客爬取豆瓣大致相同,只是代码实现不太一样.主要分为三步:获取网页信息, 解析数据, 保存数据.只是这一次保存的是图片 ...

  4. 八爪鱼采集器爬取京东手机信息

    1.下载八爪鱼采集器,运行 2.点击新建任务(高级模式) 3.在基本信息栏中输入任务名,点击下一步 4.流程栏里拖动打开网页到流程线上,并在右侧输入要打开的商品页面的url,点击保存 5.选中一个商品 ...

  5. python爬取京东商品属性_python爬虫小项目:爬取京东商品信息

    #爬取京东手机信息 import requests from bs4 import BeautifulSoup from selenium import webdriver import re imp ...

  6. python批量评论_python批量爬取京东手机评论信息及星级

    本科生在读,如有问题欢迎指正 爬取京东评论信息:评论信息是动态加载的,所以在商品详情页不能直接爬取评论. 下面以一款手机为例,详细介绍python批量爬取京东评论. 找到评论区域 image.png ...

  7. python爬虫实例手机_Python爬虫实现爬取京东手机页面的图片(实例代码)

    实例如下所示: __author__ = 'Fred Zhao' import requests from bs4 import BeautifulSoup import os from urllib ...

  8. selenium/requess爬取京东手机商品的详细信息1~selenium练习版

    selenium/requess爬取京东手机商品的详细信息1~selenium!! 前言 因为我也是个学生,所以代码可能会有点繁琐,我们都是超能100,一点点积累进步,其实有很多的地方可以简化,因为我 ...

  9. python爬虫爬取京东商品评价_python爬取京东商品信息及评论

    ''' 爬取京东商品信息: 功能: 通过chromeDrive进行模拟访问需要爬取的京东商品详情页(https://item.jd.com/100003196609.html)并且程序支持多个页面爬取 ...

最新文章

  1. Java后端WebSocket的Tomcat实现
  2. 寻找Archie服务器中的文件,Archie服务
  3. OpenCV学习笔记(二十六)——小试SVM算法ml OpenCV学习笔记(二十七)——基于级联分类器的目标检测objdect OpenCV学习笔记(二十八)——光流法对运动目标跟踪Video Ope
  4. JVM:四种引用总结
  5. 标签页如何用php静态显示,php使用标签替换的方式生成静态页面
  6. Server Tomcat v6.0 Server at localhost was unable to stat within 45 seconds
  7. 你真的了解泛型 Generic 嘛?
  8. React Hooks的使用(一)——useState、useEffect解析
  9. 一帮一python_[python]L1-030 一帮一 (15分)
  10. 2017.9.11 数列 失败总结
  11. 【clickhouse】阿里clickhouse 随便查询一条数据都报错 read time out
  12. EF 使用遇到过的错误记录备忘
  13. iOS各个版本的新特性介绍
  14. tornado和subprocess实现程序的非堵塞异步处理
  15. 18道kafka高频面试题哪些你还不会?(含答案和思维导图)
  16. Unity技术分享之Mac环境下dll反编译
  17. fastp manul page
  18. 高尔顿钉板 matlab,高尔顿钉板试验模拟
  19. 王献之碧玉小楷《洛神赋十三行》王献之小楷高清原石拓本对比图
  20. 【Office】Excel中IF函数的8种用法

热门文章

  1. 为什么需要WhatsApp聊天翻译,如何在SendWS的客服系统实现WhatsApp实时翻译群控功能?
  2. 用stackedit保存笔记
  3. 我的世界服务器战斗力系统,我的世界:这个创建10年的服务器,可能拥有MC史上规模最大的大陆...
  4. 异步 await 和.then的区别
  5. 童心制物布局国内STEAM 教育:5月将发布2款新品,未来同时聚焦B端和C端...
  6. 学历不重要,这是最坑人的谎言
  7. 【C++资料免豆下载】大量教程+工具+源码下载地址汇总(转载)
  8. conda安装jieba
  9. 安装ubantu18.04到移动硬盘
  10. java接入秒嘀API实现发送短信功能