Python 用 request+lxml 爬取某东页面商品信息

先说一下结论，采用最尾端的代码，替换掉关键词即可，输出成excel文件，获取信息为标题，价格与链接

详细记录一下程序过程吧。
首先是url的构建，最近比较关心行车记录仪，因此就以此进行关键词搜索吧，输入后url栏显示为：

https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&wq=行车记录仪&pvid=1ca…

…为个人信息，每人不一样的，
然后看看哪些是不需要的信息，经砍，发现
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8
即可，构造翻页并删除不必要信息，记录url：
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=3&s=56
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=5&s=112
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=7&s=167
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=9&s= 223
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=11&s= 278
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=13&s= 333
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=15&s= 388
https://search.jd.com/Search?keyword=行车记录仪&enc=utf-8&page=17&s= 443
主要就是page参数与s参数，其中page参数采用了2n+1，而s是一个无规律的数字。
因此，还需要特殊构建一下。
先放上代码：cookie 换成自己的哦~

import requests
import csv
import time
import random
import traceback
from lxml import etreedef url_list_l1(start_url, depth, s_list):u_list = [start_url +'&page={}&s={}'.format(2 * i + 1, j)for i, j in enumerate(s_list)]return u_list[:depth]def get_text(url):hds = {'cookie': '...','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}try:r = requests.get(url, headers=hds, timeout=30)r.raise_for_status()r.encoding = 'utf-8'return r.textexcept BaseException:traceback.print_exc()return ''def get_page_information(url_list, info_list):with open('information.csv', 'a', newline='', encoding='utf-8') as file:content = csv.writer(file)content.writerow(['Description', 'Price', 'URL'])for url in url_list:time.sleep(random.uniform(1.0, 2.0))html = get_text(url)if html == '':continueelse:response = etree.HTML(html)brands = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[4]/a/em/text()[1]')titles = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[4]/a/em/text()[2]')prices = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[3]/strong/i')urls = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[4]/a/@href')for i in range(len(urls)):if urls[i][:5] == 'https':continueelse:urls[i] = 'https:' + urls[i]# print(urls)for brand, title, price, url in zip(titles, brands, prices, urls):# ls = []price = etree.tostring(price, encoding='utf-8', method='text').decode()# print(title, brand, price, url)info_list = [title+brand , price, url]with open('information.csv', 'a', newline='', encoding='utf-8') as file:content2 = csv.writer(file)content2.writerow(info_list)url_1 = 'https://search.jd.com/Search?keyword=%E8%A1%8C%E8%BD%A6%E8%AE%B0%E5%BD%95%E4%BB%AA&enc=utf-8'
depth = 8
special_list = [1, 56, 112, 167, 223, 278, 333, 388, 433]
url_list_1 = url_list_l1(url_1, depth, special_list)
# print(url_list_1)
info_list = []
get_page_information(url_list_1, info_list)

这里有几个地方需要注意：
（1）xpath的语法与构建，当需要获得文本时，请参考price，需要用etree.tostring().decode()方法进行解析，或者表达式上加入’/text()’

 prices = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[3]/strong/i/text()')

此即可取代上述一系列的解码操作，如果text()分成两段，可以以索引得到

（2）当需要xpath属性是，参考url 方法，在xpath表达式后加上/@href

（3）写入csv文件的标准写法，先创建csv对象：

content2 = csv.writer(file)

然后再对该对象进行操作，以列表形式写入：

content2.writerow(info_list)

最后，文件中保存的UTF-8编码的csv文件如果想用excel打开会形成乱码，将其另存为ANSI编码格式，就可以很顺利的打开了，效果如下，还真是不便宜，嘿嘿：

更新一下代码：将price的xpath改掉，并优化一部分代码使其更简洁，以及一不做二不休，把文件直接导入到xls中：

import requests
import time
import random
import traceback
from lxml import etree
import xlwtdef url_list_l1(start_url, depth):u_list = [start_url + '&page={}&s={}'.format(2 * i + 1, 55 * i + 1) for i in range(depth)]return u_listdef get_text(url):hds = {'cookie': '。。。'}try:r = requests.get(url, headers=hds, timeout=30)r.raise_for_status()r.encoding = 'utf-8'return r.textexcept BaseException:traceback.print_exc()return ''def get_page_information(url_list):workbook = xlwt.Workbook()sheet1 = workbook.add_sheet('sales_information', cell_overwrite_ok=True)row_0 = ['Description', 'Price', 'URL']for i in range(len(row_0)):sheet1.write(0,i,row_0[i])count = 1for url in url_list:time.sleep(random.uniform(1.0, 2.0))html = get_text(url)if html == '':continueelse:response = etree.HTML(html)brands = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[4]/a/em/text()[1]')titles = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[4]/a/em/text()[2]')prices = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[3]/strong/i/text()')urls = response.xpath('//*[@id="J_goodsList"]/ul/li/div/div[4]/a/@href')for i in range(len(urls)):if urls[i][:5] == 'https':continueelse:urls[i] = 'https:' + urls[i]for brand, title, price, url in zip(titles, brands, prices, urls):in_list = [title+brand , price, url]for j in range(len(in_list)):sheet1.write(count,j,in_list[j])count += 1workbook.save('Sales_information.xls')if __name__ == '__main__':url_1 = 'https://search.jd.com/Search?keyword=%E8%A1%8C%E8%BD%A6%E8%AE%B0%E5%BD%95%E4%BB%AA&enc=utf-8'depth = 8url_list_1 = url_list_l1(url_1, depth)info_list = []get_page_information(url_list_1)

Python 用 request+lxml 爬取某东页面商品信息相关推荐

python爬虫 request+lxml爬取黄页88网企业信息
黄页88网: 简称黄页网或者黄页88,是由互联网资深人士创办于2009年11月.是一家整合企业黄页.分类信息以及时下流行的SNS社区三方面优势于一体定位于服务B2B平台的网站.主要帮助企业宣传推广公司 ...
python使用requests库爬取淘宝指定商品信息
python使用requests库爬取淘宝指定商品信息在搜索栏中输入商品通过F12开发者工具抓包我们知道了商品信息的API,同时发现了商品数据都以json字符串的形式存储在返回的html内解析u ...
网络爬虫爬取淘宝页面商品信息
网络爬虫爬取淘宝页面商品信息最近在MOOC上看嵩老师的网络爬虫课程,按照老师的写法并不能进行爬取,遇到了一个问题,就是关于如何"绕开"淘宝登录界面,正确的爬取相关信息.通过百度找 ...
Python爬虫入门 | 5 爬取小猪短租租房信息
小猪短租是一个租房网站,上面有很多优质的民宿出租信息,下面我们以成都地区的租房信息为例,来尝试爬取这些数据. 小猪短租(成都)页面:http://cd.xiaozhu.com/ 1.爬取租房标题 ...
python抓取文献关键信息,python爬虫——使用selenium爬取知网文献相关信息
python爬虫--使用selenium爬取知网文献相关信息写在前面: 本文章限于交流讨论,请不要使用文章的代码去攻击别人的服务器如侵权联系作者删除文中的错误已经修改过来了,谢谢各位爬友指出错误 ...
python爬虫爬取当当网的商品信息
python爬虫爬取当当网的商品信息一.环境搭建二.简介三.当当网网页分析 1.分析网页的url规律 2.解析网页html页面书籍商品html页面解析其他商品html页面解析四.代码实现 ...
用Python爬取淘宝网商品信息
用Python爬取淘宝网商品信息转载请注明出处网购时经常会用到淘宝网点我去淘宝但淘宝网上的商品琳琅满目,于是我参照中国大学 MOOC的代码写了一个爬取淘宝网商品信息的程序代码如下: impor ...
python-如何爬取天猫店铺的商品信息
** python-如何爬取天猫店铺的商品信息 ** 1.本文使用的是python-scrapy 爬取天猫博库图书专营店的数据,登录天猫获取登录之后的cookie 通过下面两幅图片elements与 ...
Scrapy爬取当当网的商品信息存到MySQL数据库
Scrapy爬取当当网的商品信息存到MySQL数据库 Scrapy 是一款十分强大的爬虫框架,能够快速简单地爬取网页,存到你想要的位置.经过两天的摸索,终于搞定了一个小任务,将当当网的商品信息爬下来存 ...

Python 用 request+lxml 爬取某东页面商品信息

Python 用 request+lxml 爬取某东页面商品信息相关推荐

最新文章

热门文章