使用selenium抓取1688供应商

为了解决采购妹子一个个的翻，我将1688网上想找的产品撸了下来，这里以放一个词为代表。代码以及完整的数据结构已上传 https://github.com/jevy146/selenium_1688/

# -*- coding: utf-8 -*-
# @Time    : 2020/6/18 9:31
# @Author  : 结尾！！
# @FileName: D01-抓取首页信息.py
# @Software: PyCharmfrom selenium.webdriver import ChromeOptions
import time
from fake_useragent import UserAgent
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait#第一步实现对淘宝的登陆
class Chrome_drive():def __init__(self):ua = UserAgent()option = ChromeOptions()option.add_experimental_option('excludeSwitches', ['enable-automation'])option.add_experimental_option('useAutomationExtension', False)NoImage = {"profile.managed_default_content_settings.images": 2}  # 控制 没有图片option.add_experimental_option("prefs", NoImage)# option.add_argument(f'user-agent={ua.chrome}')  # 增加浏览器头部# chrome_options.add_argument(f"--proxy-server=http://{self.ip}")  # 增加IP地址。。# option.add_argument('--headless')  #无头模式 不弹出浏览器self.browser = webdriver.Chrome(options=option)self.browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'})  #去掉selenium的驱动设置self.browser.set_window_size(1200,768)self.wait = WebDriverWait(self.browser, 12)def get_login(self):url='https://www.1688.com/'self.browser.get(url)#self.browser.maximize_window()  # 在这里登陆的中国大陆的邮编#这里进行人工登陆。time.sleep(2)self.browser.refresh()  # 刷新方法 refresreturn#获取判断网页文本的内容：def index_page(self,page,wd):"""抓取索引页:param page: 页码"""print('正在爬取第', page, '页')url = f'https://s.1688.com/selloffer/offer_search.htm?keywords=%D0%A1%D0%CD%C3%AB%BD%ED%BC%D3%C8%C8%B9%F1&n=y&netType=16&beginPage={page}#sm-filtbar'js1 = f" window.open('{url}')"  # 执行打开新的标签页print(url)self.browser.execute_script(js1)  # 打开新的网页标签# 执行打开新一个标签页。self.browser.switch_to.window(self.browser.window_handles[-1])  # 此行代码用来定位当前页面窗口self.buffer()  # 网页滑动  成功切换#等待元素加载出来time.sleep(3)self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#render-engine-page-container > div > div.common-pagination > div > div > div > span:nth-child(2) > input')))#获取网页的源代码html =  self.browser.page_sourceget_products(wd,html)self.close_window()def buffer(self): #滑动网页的for i in range(20):time.sleep(0.5)self.browser.execute_script('window.scrollBy(0,380)', '')  # 向下滑行300像素。def close_window(self):length=self.browser.window_handlesprint('length',length) #判断当前网页窗口的数量if  len(length) > 3:self.browser.switch_to.window(self.browser.window_handles[1])self.browser.close()time.sleep(1)self.browser.switch_to.window(self.browser.window_handles[-1])import csv
def save_csv(lise_line):file = csv.writer(open("./1688_com.csv",'a',newline="",encoding="utf-8"))file.writerow(lise_line)#解析网页，
from scrapy.selector import Selector
def get_products(wd,html_text):"""提取商品数据"""select=Selector(text=html_text)# 大概有47个items = select.xpath('//*[@id="sm-offer-list"]/div/*').extract()print('产品数 ',len(items))for i in range(1, len(items)+1):#详情页链接desc_href = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="img-container"]//a/@href').extract_first()# 图片链接img_url  = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="img"]/@style').extract_first()# 复购率shop_repurchase_rate = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="desc-container"]//span[@class="shop-repurchase-rate"]/text()').extract_first()# title  # 标题title = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="desc-container"]//div[@class="title"]//text()').extract()title_name=''.join(title)#price  #价格price = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="desc-container"]//div[@class="price-container"]/div[@class="price"]/text()').extract_first()# sales_num  # 成交量sales_num = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="desc-container"]//div[@class="price-container"]/div[@class="sale"]/text()').extract_first()#company_name  # 公司名称company_name = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="company-name"]/a/text()').extract_first()#company_href  # 公司链接company_href = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="company-name"]/a/@href').extract_first()#company_tag  # 公司标签company_tag = select.xpath(f'//*[@id="sm-offer-list"]/div[{i}]//div[@class="common-company-tag"]//text()').extract_first()all_desc=[wd,title_name,img_url,desc_href,price,sales_num,company_name,company_href,company_tag,shop_repurchase_rate]print(all_desc)save_csv(all_desc)def main():"""遍历每一页"""run=Chrome_drive()run.get_login() #先扫码登录wd=['小型毛巾加热柜']for w in wd:for i in range(1, 6): #1688总计展示了6页，抓取了前5页的内容run.index_page(i,w)if __name__ == '__main__':csv_head = 'word,title_name,img_url,desc_href,price,sales_num,company_name,company_href,company_tag,shop_repurchase_rate'.split(',')save_csv(csv_head)main()

运行过程

3.运行结束抓了5页，一页60个，没有遗漏的

使用selenium抓取1688供应商相关推荐

selenium抓取_使用Selenium的网络抓取电子商务网站
selenium抓取 In this article we will go through a web scraping process of an E-Commerce website. I hav ...
Python爬虫实战八之利用Selenium抓取淘宝匿名旺旺
其实本文的初衷是为了获取淘宝的非匿名旺旺,在淘宝详情页的最下方有相关评论,含有非匿名旺旺号,快一年了淘宝都没有修复这个. 很多人学习python,不知道从何学起. 很多人学习python,掌握了基本语 ...
[Python] python + selenium 抓取京东商品数据（商品名称，售价，店铺，分类，订单信息，好评率，评论等）
目录一.环境二.简介三.京东网页分析 1.获取商品信息入口--商品列表链接获取 2.获取商品信息入口--商品详情链接获取 3.商品详情获取 4.商品评论获取四.代码实现五.运行结果六.结语 ...
selenium抓取动态网页数据
1.selenium抓取动态网页数据基础介绍 1.1 什么是AJAX AJAX(Asynchronouse JavaScript And XML:异步JavaScript和XML)通过在后台与服务器进 ...
python selenium 处理弹窗_python+selenium 抓取弹出对话框信息
抓取弹出对话框信息,困挠了我很久,我百度了很久,一直没有找到我想要的内容.最近学习到了. 有两种方法: 1.driver.switch_to.alert.text 2.result = EC.aler ...
Python爬虫用Selenium抓取js生成的文件(一)
简介任务简述实现过程简介我最近在看关于计算机的一些书籍,发现了这个电子书清单:计算机开放电子书汇总, 和大家分享一下. 我在下载其中的书籍时被导向了这个很好的计算机电子书网站KanCloud看 ...
python 弹出对话框_python+selenium 抓取弹出对话框信息
抓取弹出对话框信息,困挠了我很久,我百度了很久,一直没有找到我想要的内容.最近学习到了. 有两种方法: 1.driver.switch_to.alert.text 2.result = EC.aler ...
python爬携程_用python selenium抓取携程信息
最近在学习selenium,遇到一个很奇怪的问题,debug了半天还是没弄明白,我是在测试抓取携程网站的机票信息我的代码: # -*- coding: utf-8 -*- from selenium ...
python爬取豆瓣读书_用python+selenium抓取豆瓣读书中最受关注图书并按照评分排序...
抓取豆瓣读书中的(http://book.douban.com/)最受关注图书,按照评分排序,并保存至txt文件中,需要抓取书籍的名称,作者,评分,体裁和一句话评论方法一:#coding=utf-8 ...

使用selenium抓取1688供应商

使用selenium抓取1688供应商相关推荐

最新文章

热门文章