scrapy框架爬取校花网站的升级版

           **spider目录下的文件：定义DemoSpider类**
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from img.items import ImgItem
#from bs4 import BeautifulSoup
#import urllib
#import requests
class  DemoSpider(CrawlSpider):name='demo'start_urls = ['http://www.xiaohuar.com/list-1-2.html']"""第一个Rule是用来筛选所有的网页第二个是用来搜索当前页面的所有校花的子urlallow里面的是正则表达式索引带有这个正则的urlrestrict_xpaths限定的是搜索的范围callback回调函数，用来处理页面process_links用来定义出来url的链接，其中定义的函数要传入参数linksfollow是用来定义是否跟进"""rules={Rule(LinkExtractor(allow=('http://www.xiaohuar.com/list'),restrict_xpaths=("//div[@class='page_num']")),#callback="paser_url",follow=True),Rule(LinkExtractor(allow='/p',restrict_xpaths="//div[@class='title']"),callback="paser_item",follow=False)}def paser_item(self,response):item=ImgItem()url=response.urlprint "url=%s"%url#检查异常try:img_url=response.xpath("//div[@class='infoleft_imgdiv']/a/img/@src").extract()[0]name=response.xpath("//div[@class='infodiv']/table/tbody/tr[1]/td[2]/text()").extract()school=response.xpath("//div[@class='infodiv']/table/tbody/tr[5]/td[2]/text()").extract()if 'http://www.xiaohuar.com' not in img_url:item['url'] = 'http://www.xiaohuar.com'+img_urlelse:item['url']=img_urlitem['name'] = nameitem['school'] = schoolyield itemexcept Exception:print 'error'**定义items文件**
import scrapyclass ImgItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()url=scrapy.Field()name=scrapy.Field()school=scrapy.Field()*******定义pipelines文件***********import codecs
import json
import urllib
import os
class ImgPipeline(object):def __init__(self):self.file=codecs.open('items.json','w',encoding='utf-8') #以json文件的方式打开，编码为utf-8，否则会乱码# self.file_path=os.path.normpath("h:\\scrapy\\img\\img_picture")# self.count=1def process_item(self, item, spider):    #必须实现的函数，用于操作itemline=json.dumps(dict(item),ensure_ascii=False)+'\n'    #将item中的每一个数据转换成json格式的并且每一个数据都要换行娴熟# if not os.path.exists(self.file_path):#     os.mkdir(self.file_path)#     img_name=os.path.normpath("h:\\scrapy\\img\\img_picture\\%s.jpg"%self.count)#     urllib.urlretrieve(item['url'],img_name)#     self.count+=1self.file.write(line)return item      #最后一般都要返回item，以便后续还要操作itemdef close_file(self):self.file.close()***********seetting****************#在setting文件中加上下面这句话ITEM_PIPELINES={"img.pipelines.ImgPipeline":300
}

scrapy框架爬取校花网站的升级版相关推荐

利用Python Scrapy框架爬取“房天下”网站房源数据
文章目录分析网页获取新房.二手房.租房数据新房数据租房数据: 二手房数据反反爬虫将数据保存至MongoDB数据库 JSON格式 CSV格式 MongoDB数据库分析网页 "房天 ...
scrapy框架爬取糗妹妹网站妹子图分类的所有图片
爬取所有图片,一个页面的图片建一个文件夹.难点,图片中有不少.gif图片,需要重写下载规则, 创建scrapy项目 scrapy startproject qiumeimei 创建爬虫应用 cd qi ...
Python爬虫框架 scrapy 入门经典project 爬取校花网资源、批量下载图片
####1.安装scrapy 建议:最好在新的虚拟环境里面安装scrapy 注意:博主是在 Ubuntu18.04 + Python3.6 环境下进行开发的,如果遇到安装scrapy不成功请自行百度/ ...
scrapy框架爬取网站图片
使用scrapy 框架爬取彼岸图库前言: 这两天在网上学习了一下scrapy框架,发现及其好用,把爬虫步骤分的细细的.所以写了一个简单项目回顾一下并分享给大家^ . ^ 源码我已经放到Github了 ...
Python的Scrapy框架爬取诗词网站爱情诗送给女友
文章目录前言效果展示: 一.安装scrapy库二.创建scrapy项目三.新建爬虫文件scmg_spider.py 四.配置settings.py文件五.定义数据容器,修改item.py文件 ...
scrapy 爬取校花网
原文链接: scrapy 爬取校花网上一篇: scrapy 安装和简单命令下一篇: scrapy 腾讯招聘信息爬取网址,爬取名称和对应的图片链接,并保存为json格式 http://www.x ...
利用python的scrapy框架爬取google搜索结果页面内容
scrapy google search 实验目的爬虫实习的项目1,利用python的scrapy框架爬取google搜索结果页面内容. https://github.com/1012598167/ ...
使用Xpath爬取校花网，致敬10年前的校花『和』我们逝去的青春
使用xpath爬取校花网难点: 1.各个分类栏目下的页码url不统一 2.只取前三页,或者后三页文章代码仅使用xpath和requests,本来想用scrapy框架的,但是偷了个懒. 所以就-哈哈 ...
03_使用scrapy框架爬取豆瓣电影TOP250
前言: 本次项目是使用scrapy框架,爬取豆瓣电影TOP250的相关信息.其中涉及到代理IP,随机UA代理,最后将得到的数据保存到mongoDB中.本次爬取的内容实则不难.主要是熟悉scrapy相关 ...

scrapy框架爬取校花网站的升级版

scrapy框架爬取校花网站的升级版相关推荐

最新文章

热门文章