python3爬虫实例-python3 网络爬虫实例1

scrapy

pip install scrapy

pip install pyOpenSSL

pip install cryptography

pip install CFFI

pip install lxml

pip install cssselect

pip install Twisted

创建爬虫项目

scrapy startproject zhipinSpider

生成爬虫

scrapy genspider job_position "zhipin.com"

image.png

目录结构：

items.py :

pipelines.py:处理爬取的内容

settings.py :配置文件

先调试数据

让scrapy伪装成浏览器

XPath语法

/ 匹配根节点

// 任意节点

. 当前节点

.. 父节点

@ 属性

//div[@title="xxx"]/div

extract提取节点内容

image.png

CSS匹配

image.png

items.py

import scrapy

class ZhipinspiderItem(scrapy.Item):

# 工作名称

title = scrapy.Field()

# 工资

salary = scrapy.Field()

# 招聘公司

company = scrapy.Field()

# 工作详细链接

url = scrapy.Field()

# 工作地点

work_addr = scrapy.Field()

# 行业

industry = scrapy.Field()

# 公司规模

company_size = scrapy.Field()

# 招聘人

recruiter = scrapy.Field()

# 发布时间

publish_date = scrapy.Field()

job_spider.py

import scrapy

from ZhipinSpider.items import ZhipinspiderItem

class JobPositionSpider(scrapy.Spider):

# 定义该Spider的名字

name = 'job_position'

# 定义该Spider允许爬取的域名

allowed_domains = ['zhipin.com']

# 定义该Spider爬取的首页列表

start_urls = ['https://www.zhipin.com/c101280100/h_101280100/']# 该方法负责提取response所包含的信息

# response代表下载器从start_urls中每个URL下载得到的响应

def parse(self, response):

# 遍历页面上所有//div[@class="job-primary"]节点

for job_primary in response.xpath('//div[@class="job-primary"]'):

item = ZhipinspiderItem()

# 匹配//div[@class="job-primary"]节点下/div[@class="info-primary"]节点

# 也就是匹配到包含工作信息的

元素

info_primary = job_primary.xpath('./div[@class="info-primary"]')

item['title'] = info_primary.xpath('./h3/a/div[@class="job-title"]/text()').extract_first()

item['salary'] = info_primary.xpath('./h3/a/span[@class="red"]/text()').extract_first()

item['work_addr'] = info_primary.xpath('./p/text()').extract_first()

item['url'] = info_primary.xpath('./h3/a/@href').extract_first()

# 匹配//div[@class="job-primary"]节点下./div[@class="info-company"]节点下

# 的/div[@class="company-text"]的节点

# 也就是匹配到包含公司信息的

元素

company_text = job_primary.xpath('./div[@class="info-company"]' +

'/div[@class="company-text"]')

item['company'] = company_text.xpath('./h3/a/text()').extract_first()

company_info = company_text.xpath('./p/text()').extract()

if company_info and len(company_info) > 0:

item['industry'] = company_info[0]

if company_info and len(company_info) > 2:

item['company_size'] = company_info[2]

# 匹配//div[@class="job-primary"]节点下./div[@class="info-publis"]节点下

# 也就是匹配到包含发布人信息的

元素

info_publis = job_primary.xpath('./div[@class="info-publis"]')

item['recruiter'] = info_publis.xpath('./h3/text()').extract_first()

item['publish_date'] = info_publis.xpath('./p/text()').extract_first()

yield item

# 解析下一页的链接

new_links = response.xpath('//div[@class="page"]/a[@class="next"]/@href').extract()

if new_links and len(new_links) > 0:

# 获取下一页的链接

new_link = new_links[0]

# 再次发送请求获取下一页数据

yield scrapy.Request("https://www.zhipin.com" + new_link, callback=self.parse)

pipelines.py

class ZhipinspiderPipeline(object):

def process_item(self, item, spider):

print("工作:" , item['title'])

print("工资:" , item['salary'])

print("工作地点:" , item['work_addr'])

print("详情链接:" , item['url'])print("公司:" , item['company'])

print("行业:" , item['industry'])

print("公司规模:" , item['company_size'])

print("招聘人:" , item['recruiter'])

print("发布日期:" , item['publish_date'])

settings.py

-- coding: utf-8 --

Scrapy settings for ZhipinSpider project

For simplicity, this file contains only settings considered important or

commonly used. You can find more settings consulting the documentation:

BOT_NAME = 'ZhipinSpider'

SPIDER_MODULES = ['ZhipinSpider.spiders']

NEWSPIDER_MODULE = 'ZhipinSpider.spiders'

Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'ZhipinSpider (+http://www.yourdomain.com)'

Obey robots.txt rules

ROBOTSTXT_OBEY = True

Configure maximum concurrent requests performed by Scrapy (default: 16)

CONCURRENT_REQUESTS = 32

Configure a delay for requests for the same website (default: 0)

python3爬虫实例-python3 网络爬虫实例1相关推荐

python3.6网络爬虫_python3.6网络爬虫
<精通Python网络爬虫:核心技术.框架与项目实战>--导读前言为什么写这本书网络爬虫其实很早就出现了,最开始网络爬虫主要应用在各种搜索引擎中.在搜索引擎中,主要使用通用网络爬虫 ...
精通python网络爬虫-精通Python网络爬虫：核心技术、框架与项目实战 PDF
给大家带来的一篇关于Python爬虫相关的电子书资源,介绍了关于Python.Python网络爬虫.Python核心技术.Python框架.Python项目实战方面的内容,本书是由机械工业出版社出版, ...
python爬虫设计模式_Python3网络爬虫(一)：利用urllib进行简单的网页抓取
点击蓝色字免费订阅,每天收到这样的好信息前言:最近有不少粉丝关注本公众号.并且我已经成功开通了流量主同时会赚一点点广告费,我打算每个月把这部分钱拿出来给大家买点书刊,算是给大家一点福利吧.大家想买什 ...
精通python网络爬虫-精通Python网络爬虫：核心技术、框架与项目实战
-- 目录 -- 前言第一篇理论基础篇第1章什么是网络爬虫 1.1 初识网络爬虫 1.2 为什么要学网络爬虫 1.3 网络爬虫的组成 1.4 网络爬虫的类型 1.5 爬虫扩展--聚焦爬虫 1. ...
python商业爬虫教程_廖雪峰老师的Python商业爬虫课程 Python网络爬虫实战教程体会不一样的Python爬虫课程...
廖雪峰老师的Python商业爬虫课程 Python网络爬虫实战教程体会不一样的Python爬虫课程 1.JPG (53.51 KB, 下载次数: 1) 2019-8-9 08:15 上传 2.JPG ...
python爬虫程序-Python网络爬虫实战(一)快速入门
本系列从零开始阐述如何编写Python网络爬虫,以及网络爬虫中容易遇到的问题,比如具有反爬,加密的网站,还有爬虫拿不到数据,以及登录验证等问题,会伴随大量网站的爬虫实战来进行. 我们编写网络爬虫最主要 ...
Python 网络爬虫 001 (科普) 网络爬虫简介
Python 网络爬虫 001 (科普) 网络爬虫简介 1. 网络爬虫是干什么的我举几个生活中的例子: 例子一: 我平时会将学到的知识和积累的经验写成博客发送到CSDN博客网站上,那么对于我 ...
爬虫分类——通用网络爬虫、聚焦网络爬虫、增量式网络爬虫、深层网络爬虫
爬虫分类网络爬虫按照系统结构和实现技术,大致可以分为以下几种类型:通用网络爬虫.聚焦网络爬虫.增量式网络爬虫.深层网络爬虫. 实际的网络爬虫系统通常是几种爬虫技术相结合实现的通用网络爬虫通用网络 ...
python爬虫什么意思-网络爬虫是什么(python爬虫有什么用)
在这个谈论数据的时代,数据是一件极其重要的事情.我们如何获取完整而全面的数据?这不是一项容易的任务. 如果你想做好大数据分析,光靠自己的努力或外围数据是远远不够的,你需要依靠"神秘的外力&q ...
python 爬虫论_Python网络爬虫（理论篇）
欢迎关注公众号:Python爬虫数据分析挖掘,回复[开源源码]免费获取更多开源项目源码网络爬虫的组成网络爬虫由控制节点,爬虫节点,资源库构成. 网络爬虫的控制节点和爬虫节点的结构关系控制节点(爬 ...

python3爬虫实例-python3 网络爬虫实例1

python3爬虫实例-python3 网络爬虫实例1相关推荐

最新文章

热门文章

python3爬虫实例-python3 网络爬虫 实例1

python3爬虫实例-python3 网络爬虫 实例1相关推荐

最新文章

热门文章

python3爬虫实例-python3 网络爬虫实例1

python3爬虫实例-python3 网络爬虫实例1相关推荐