爬虫第一弹——爬取京东手机信息
一、Eclipse+WebMagic配置
在Eclipse中配置WebMagic之前一篇文章有介绍,Eclipse下配置WebMagic,仅供参考,还可以通过添加依赖的方式配置WebMagic。
二、爬虫步骤
1.由于反爬机制的存在,如果不进行伪装,对方服务器会将爬虫屏蔽,此时要进行浏览器的伪装
private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36");
2.选取一个京东手机品牌界面,比如选取apple界面
public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655";
3.程序入口爬取设置,设置6个线程、同时以json文件格式存储到指定的位置
public static void main(String[] args) {JDSpider jDS = new JDSpider();Spider spider = Spider.create(jDS);spider.addUrl(starts);//json数据存入目录spider.addPipeline(new JsonFilePipeline("C:\\Users\\xxx\\Desktop\\test"));spider.thread(6);spider.run();}
其中addPipeline后续所跟的位置为自己创建的一个存储文件夹的位置
4.对于一些手机信息界面是动态的,比如手机价格、好评总数等等,会不断变化,于时这些数据便存在单独的json格式的网页中,而不是爬取到的静态页面中,所以需要一个能将url网址转换成json数据的函数
//url获取json数据
public String loadJson(String url) {StringBuilder json = new StringBuilder();try {URL urlObject = new URL(url);URLConnection uc = urlObject.openConnection();BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK"));String inputLine = null;while ((inputLine = in.readLine()) != null) {json.append(inputLine);}in.close();} catch (Exception e) {e.printStackTrace();}return json.toString();
}
5.对于apple手机(其他品牌同理),在一个界面上只能有最多个固定数量的同品牌手机个数,但有很多个界面,所以网址之间一定存在某种规律,找到规律,得到所有大类界面(一个界面有多个手机)的网址
for (int j = 0; j < 11; j++) {String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0";page.addTargetRequest(item);
}
6.之后得到所有手机界面的网址,通过xpath找到大类界面上所有的手机界面网址。首先判断符合条件的大类网址
if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E"))
然后获取所有的手机
// 获取所有手机页面
List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all();
for (int j = 0; j < items.size(); j++) {String itemsNew = "https:" + items.get(j);page.addTargetRequest(itemsNew);
}
7接下来获取所需要的手机属性,如下获取手机名称和url
// 手机名称
String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get();
page.putField("name", name);
// 获取手机url
String id = urltwo.replace("https://item.jd.com/", "").replace(".html", "");
page.putField("url", id);
对于价格、好评度等动态界面,需要通过解析来获取到具体的json数据
(1)价格的url举例:
https://p.3.cn/prices/mgets?skuIds=J_10026711061553
提取到的规律:
String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
其中id为上一步中所获得的id,根据价格界面的构造,来书写JavaBean类,对价格解析,其中JavaBean类如下
public class Price {private String op;private String m;private String cbf;private String id;private String p;public String getOp() {return op;}public void setOp(String op) {this.op = op;}public String getM() {return m;}public void setM(String m) {this.m = m;}public String getCbf() {return cbf;}public void setCbf(String cbf) {this.cbf = cbf;}public String getId() {return id;}public void setId(String id) {this.id = id;}public String getP() {return p;}public void setP(String p) {this.p = p;}
}
对价格的json解析和爬取
String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;
List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class);
page.putField("price", prices.get(0).getP());
(2)好评总数的url举例:
https://club.jd.com/comment/productCommentSummaries.action?referenceIds=10026711061553
JavaBean类
public class CommentsCount {private long SkuId;private long ProductId;private int ShowCount;private String ShowCountStr;private String CommentCountStr;private int CommentCount;private int AverageScore;private String DefaultGoodCountStr;private int DefaultGoodCount;private String GoodCountStr;private int GoodCount;private int AfterCount;private int OneYear;private String AfterCountStr;private int VideoCount;private String VideoCountStr;private double GoodRate;private int GoodRateShow;private int GoodRateStyle;private String GeneralCountStr;private int GeneralCount;private double GeneralRate;private int GeneralRateShow;private int GeneralRateStyle;private String PoorCountStr;private int PoorCount;private int SensitiveBook;private double PoorRate;private int PoorRateShow;private int PoorRateStyle;public void setSkuId(long SkuId) {this.SkuId = SkuId;}public long getSkuId() {return SkuId;}public void setProductId(long ProductId) {this.ProductId = ProductId;}public long getProductId() {return ProductId;}public void setShowCount(int ShowCount) {this.ShowCount = ShowCount;}public int getShowCount() {return ShowCount;}public void setShowCountStr(String ShowCountStr) {this.ShowCountStr = ShowCountStr;}public String getShowCountStr() {return ShowCountStr;}public void setCommentCountStr(String CommentCountStr) {this.CommentCountStr = CommentCountStr;}public String getCommentCountStr() {return CommentCountStr;}public void setCommentCount(int CommentCount) {this.CommentCount = CommentCount;}public int getCommentCount() {return CommentCount;}public void setAverageScore(int AverageScore) {this.AverageScore = AverageScore;}public int getAverageScore() {return AverageScore;}public void setDefaultGoodCountStr(String DefaultGoodCountStr) {this.DefaultGoodCountStr = DefaultGoodCountStr;}public String getDefaultGoodCountStr() {return DefaultGoodCountStr;}public void setDefaultGoodCount(int DefaultGoodCount) {this.DefaultGoodCount = DefaultGoodCount;}public int getDefaultGoodCount() {return DefaultGoodCount;}public void setGoodCountStr(String GoodCountStr) {this.GoodCountStr = GoodCountStr;}public String getGoodCountStr() {return GoodCountStr;}public void setGoodCount(int GoodCount) {this.GoodCount = GoodCount;}public int getGoodCount() {return GoodCount;}public void setAfterCount(int AfterCount) {this.AfterCount = AfterCount;}public int getAfterCount() {return AfterCount;}public void setOneYear(int OneYear) {this.OneYear = OneYear;}public int getOneYear() {return OneYear;}public void setAfterCountStr(String AfterCountStr) {this.AfterCountStr = AfterCountStr;}public String getAfterCountStr() {return AfterCountStr;}public void setVideoCount(int VideoCount) {this.VideoCount = VideoCount;}public int getVideoCount() {return VideoCount;}public void setVideoCountStr(String VideoCountStr) {this.VideoCountStr = VideoCountStr;}public String getVideoCountStr() {return VideoCountStr;}public void setGoodRate(double GoodRate) {this.GoodRate = GoodRate;}public double getGoodRate() {return GoodRate;}public void setGoodRateShow(int GoodRateShow) {this.GoodRateShow = GoodRateShow;}public int getGoodRateShow() {return GoodRateShow;}public void setGoodRateStyle(int GoodRateStyle) {this.GoodRateStyle = GoodRateStyle;}public int getGoodRateStyle() {return GoodRateStyle;}public void setGeneralCountStr(String GeneralCountStr) {this.GeneralCountStr = GeneralCountStr;}public String getGeneralCountStr() {return GeneralCountStr;}public void setGeneralCount(int GeneralCount) {this.GeneralCount = GeneralCount;}public int getGeneralCount() {return GeneralCount;}public void setGeneralRate(double GeneralRate) {this.GeneralRate = GeneralRate;}public double getGeneralRate() {return GeneralRate;}public void setGeneralRateShow(int GeneralRateShow) {this.GeneralRateShow = GeneralRateShow;}public int getGeneralRateShow() {return GeneralRateShow;}public void setGeneralRateStyle(int GeneralRateStyle) {this.GeneralRateStyle = GeneralRateStyle;}public int getGeneralRateStyle() {return GeneralRateStyle;}public void setPoorCountStr(String PoorCountStr) {this.PoorCountStr = PoorCountStr;}public String getPoorCountStr() {return PoorCountStr;}public void setPoorCount(int PoorCount) {this.PoorCount = PoorCount;}public int getPoorCount() {return PoorCount;}public void setSensitiveBook(int SensitiveBook) {this.SensitiveBook = SensitiveBook;}public int getSensitiveBook() {return SensitiveBook;}public void setPoorRate(double PoorRate) {this.PoorRate = PoorRate;}public double getPoorRate() {return PoorRate;}public void setPoorRateShow(int PoorRateShow) {this.PoorRateShow = PoorRateShow;}public int getPoorRateShow() {return PoorRateShow;}public void setPoorRateStyle(int PoorRateStyle) {this.PoorRateStyle = PoorRateStyle;}public int getPoorRateStyle() {return PoorRateStyle;}
}
import java.util.List;public class JsonRootBean { private List<CommentsCount> CommentsCount;public void setCommentsCount(List<CommentsCount> CommentsCount) {this.CommentsCount = CommentsCount;}public List<CommentsCount> getCommentsCount() {return CommentsCount;}
}
对好评总数的json解析和爬取
String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id;
JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class);
List<CommentsCount> count = jrt.getCommentsCount();
page.putField("goodRateShow", count.get(0).getGoodRateShow());
page.putField("comment", count.get(0).getCommentCountStr());
到此,一个简单的爬虫程序就基本完成了,爬取到的所有数据都存储到了自己所创建的那个文件夹中。
三、程序源代码
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.List;import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
public class JDSpider implements PageProcessor {// 浏览器伪装private Site site = Site.me().setRetryTimes(3).setSleepTime(10000).setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36");// oneplus手机开始界面public static final String starts = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&cid3=655";public static void main(String[] args) {JDSpider jDS = new JDSpider();Spider spider = Spider.create(jDS);spider.addUrl(starts);//json数据存入目录,一共346个数据族spider.addPipeline(new JsonFilePipeline("C:\\Users\\zxf\\Desktop\\testnew"));spider.thread(6);spider.run();}//url获取json数据public String loadJson(String url) {StringBuilder json = new StringBuilder();try {URL urlObject = new URL(url);URLConnection uc = urlObject.openConnection();BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream(), "GBK"));String inputLine = null;while ((inputLine = in.readLine()) != null) {json.append(inputLine);}in.close();} catch (Exception e) {e.printStackTrace();}return json.toString();}@Overridepublic void process(Page page) {// List<String> items = page.getHtml().xpath("//div[@class='p-name// p-name-type-3']/a/@href").all();for (int j = 0; j < 11; j++) {String item = "https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E&pvid=17eff9be0146499b85c1e66a0f0e7ea9&page=" + (2 * j + 1) + "&s=" + (60 * j + 1) + "&click=0";page.addTargetRequest(item);}String urlone = page.getRequest().getUrl();if (urlone.startsWith("https://list.jd.com/list.html?cat=9987%2C653%2C655&ev=exbrand_Apple%5E")) {// 获取所有手机页面List<String> items = page.getHtml().xpath("//div[@class='p-name p-name-type-3']/a/@href").all();for (int j = 0; j < items.size(); j++) {String itemsNew = "https:" + items.get(j);page.addTargetRequest(itemsNew);}// 跳过处理page.setSkip(true);}String urltwo = page.getRequest().getUrl();// 获取所有的手机idif (urltwo.startsWith("https://item.jd.com")) {// 商品名称String name = page.getHtml().xpath("//div[@class='product-intro clearfix']" + "//div[@class='sku-name']/text()").get();page.putField("name", name);// 获取手机urlString id = urltwo.replace("https://item.jd.com/", "").replace(".html", "");page.putField("url", id);// 价格链接:https://p.3.cn/prices/mgets?skuIds=J_10026711061553// 评价链接:https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100014348492&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1// https://club.jd.com/comment/productCommentSummaries.action?referenceIds=100014348492// 商品评价链接String pJID = "https://club.jd.com/comment/productCommentSummaries.action?referenceIds=" + id;JsonRootBean jrt = JSONObject.parseObject(loadJson(pJID), JsonRootBean.class);List<CommentsCount> count = jrt.getCommentsCount();page.putField("goodRateShow", count.get(0).getGoodRateShow());page.putField("comment", count.get(0).getCommentCountStr());String pr = "https://p.3.cn/prices/mgets?skuIds=J_" + id;List<Price> prices = JSONArray.parseArray(loadJson(pr), Price.class);page.putField("price", prices.get(0).getP());}}@Overridepublic Site getSite() {// TODO Auto-generated method stubreturn site;}
}
同时还需要上边的3个JavaBean类,程序具体代码已在上边给出啦!
爬虫第一弹——爬取京东手机信息相关推荐
- layui获取input信息_python爬虫—用selenium爬取京东商品信息
python爬虫--用selenium爬取京东商品信息 1.先附上效果图(我偷懒只爬了4页) 2.京东的网址https://www.jd.com/ 3.我这里是不加载图片,加快爬取速度,也可以用Hea ...
- go爬虫和python爬虫哪个好_python 爬虫实战项目--爬取京东商品信息(价格、优惠、排名、好评率等)-Go语言中文社区...
利用splash爬取京东商品信息 一.环境 window7 python3.5 pycharm scrapy scrapy-splash MySQL 二.简介 为了体验scrapy-splash 的动 ...
- 爬虫实战:爬取京东手机图片并保存到本地
先看一下效果: 这个爬虫的功能是将京东上的手机图片爬取并保存下来,其思路和我上一篇博客爬取豆瓣大致相同,只是代码实现不太一样.主要分为三步:获取网页信息, 解析数据, 保存数据.只是这一次保存的是图片 ...
- 八爪鱼采集器爬取京东手机信息
1.下载八爪鱼采集器,运行 2.点击新建任务(高级模式) 3.在基本信息栏中输入任务名,点击下一步 4.流程栏里拖动打开网页到流程线上,并在右侧输入要打开的商品页面的url,点击保存 5.选中一个商品 ...
- python爬取京东商品属性_python爬虫小项目:爬取京东商品信息
#爬取京东手机信息 import requests from bs4 import BeautifulSoup from selenium import webdriver import re imp ...
- python批量评论_python批量爬取京东手机评论信息及星级
本科生在读,如有问题欢迎指正 爬取京东评论信息:评论信息是动态加载的,所以在商品详情页不能直接爬取评论. 下面以一款手机为例,详细介绍python批量爬取京东评论. 找到评论区域 image.png ...
- python爬虫实例手机_Python爬虫实现爬取京东手机页面的图片(实例代码)
实例如下所示: __author__ = 'Fred Zhao' import requests from bs4 import BeautifulSoup import os from urllib ...
- selenium/requess爬取京东手机商品的详细信息1~selenium练习版
selenium/requess爬取京东手机商品的详细信息1~selenium!! 前言 因为我也是个学生,所以代码可能会有点繁琐,我们都是超能100,一点点积累进步,其实有很多的地方可以简化,因为我 ...
- python爬虫爬取京东商品评价_python爬取京东商品信息及评论
''' 爬取京东商品信息: 功能: 通过chromeDrive进行模拟访问需要爬取的京东商品详情页(https://item.jd.com/100003196609.html)并且程序支持多个页面爬取 ...
最新文章
- Java后端WebSocket的Tomcat实现
- 寻找Archie服务器中的文件,Archie服务
- OpenCV学习笔记(二十六)——小试SVM算法ml OpenCV学习笔记(二十七)——基于级联分类器的目标检测objdect OpenCV学习笔记(二十八)——光流法对运动目标跟踪Video Ope
- JVM:四种引用总结
- 标签页如何用php静态显示,php使用标签替换的方式生成静态页面
- Server Tomcat v6.0 Server at localhost was unable to stat within 45 seconds
- 你真的了解泛型 Generic 嘛?
- React Hooks的使用(一)——useState、useEffect解析
- 一帮一python_[python]L1-030 一帮一 (15分)
- 2017.9.11 数列 失败总结
- 【clickhouse】阿里clickhouse 随便查询一条数据都报错 read time out
- EF 使用遇到过的错误记录备忘
- iOS各个版本的新特性介绍
- tornado和subprocess实现程序的非堵塞异步处理
- 18道kafka高频面试题哪些你还不会?(含答案和思维导图)
- Unity技术分享之Mac环境下dll反编译
- fastp manul page
- 高尔顿钉板 matlab,高尔顿钉板试验模拟
- 王献之碧玉小楷《洛神赋十三行》王献之小楷高清原石拓本对比图
- 【Office】Excel中IF函数的8种用法
热门文章
- 为什么需要WhatsApp聊天翻译,如何在SendWS的客服系统实现WhatsApp实时翻译群控功能?
- 用stackedit保存笔记
- 我的世界服务器战斗力系统,我的世界:这个创建10年的服务器,可能拥有MC史上规模最大的大陆...
- 异步 await 和.then的区别
- 童心制物布局国内STEAM 教育:5月将发布2款新品,未来同时聚焦B端和C端...
- 学历不重要,这是最坑人的谎言
- 【C++资料免豆下载】大量教程+工具+源码下载地址汇总(转载)
- conda安装jieba
- 安装ubantu18.04到移动硬盘
- java接入秒嘀API实现发送短信功能