scrapy抓取cnblog新闻

tutorial/items.py：项目的items文件
tutorial/pipelines.py：项目的pipelines文件，需要注册到setting.py中，会自动执行process_item方法
tutorial/settings.py：项目的设置文件
tutorial/spiders/：存储爬虫的目录，写好文件后，自动生效

目标：抓取cnblog的标题和新闻

1.新建立一个项目
执行

scrapy startproject cnblog

2.修改item.py，添加title、url、content字段

import scrapyclass CnblogItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    url = scrapy.Field()    title = scrapy.Field()    content = scrapy.Field()    pass

3.新建一个spider，抓取内容

from scrapy.spider import Spider  from scrapy.selector import Selector import urllib.requestfrom scrapy.http import Requestfrom cnblog.items import CnblogItem

class CnSpider(Spider):  

    #爬虫的名称    name = "cnblog"      #允许爬的域名，防止爬偏了    allowed_domains = ["cnblogs.com","news.cnblogs.com"]   

    #设置起始的链接,获取全部翻页链接    start_urls = []    for pn in range(1,2):        url = 'https://news.cnblogs.com/n/page/%s/' % pn        start_urls.append(url)

     #获取所有的内容页面链接       def parse(self, response):          sel = Selector(response)          news_list = sel.xpath('//h2[@class="news_entry"]') 

        for new_i in news_list:              new_link=new_i.xpath('a/@href').extract()             link_0=str("https://news.cnblogs.com"+new_link[0])            yield Request(link_0,callback=self.parse_item)

     #抓取新闻详细页内容    def parse_item(self,response):        item = CnblogItem()        item['url'] = response.request.url        item['title'] = response.xpath('//div[@id="news_title"]/a/text()').extract()[0]        item['content'] = response.xpath('//div[@id="news_body"]').extract()[0]        yield item

4.定义pipelines，将内容保存到items.jl

# -*- coding: utf-8 -*-

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport json  from Util import FileUtil

class CnblogPipeline(object):    def __init__(self):        self.file = open('items.jl', 'w')        self.url_seen = set()

    def process_item(self, item, spider):        #过滤下重复的数据        if item['url'] in self.url_seen:              raise DropItem("Duplicate item found: %s" % item)          else:              self.url_seen.add(item['url'])            line = json.dumps(dict(item)) + "\n"            self.file.write(line)            FileUtil.saveNews(item['url'],item['title'],item['content'])        return item

5.激活定义的pipelines
修改settings.py

ITEM_PIPELINES = {    'cnblog.pipelines.CnblogPipeline': 300,}

6.FileUtil.py的代码

from urllib.request import quoteimport urllib.requestimport sqlite3

class FileUtil:  

    #根据url下载图片，如果没有设置图片地址，自动保存到D:\\download\\图片名称    def downImg(imgUrl,savePath=None):        imgName=imgUrl.split('/')[-1]        preUrl=imgUrl.replace(imgName,"")        if savePath is None:            savePath="D:\\download\\"+imgName

        conn = urllib.request.urlopen(preUrl+quote(imgName))        f = open(savePath,'wb')        f.write(conn.read())        f.close()        print('Saved:'+savePath) 

    def saveNews(url,title=None,content=None):        if title is None:            title=""        if content is None:            content=""        conn = sqlite3.connect('news.db')        cursor = conn.cursor()        # 执行一条SQL语句，创建user表:        cursor.execute('create table IF NOT EXISTS news (id INTEGER PRIMARY KEY, url varchar(100),title vachar(100),content text)')        cursor.execute('select * from news where url=\''+url+'\'')        values=cursor.fetchall()        if len(values) > 0:#链接以前就存在            print('链接已经存在:'+url)        else:            cursor.execute('insert into news (url, title,content) values (\''+url+'\', \''+title+'\', \''+content+'\')')            print("save success."+url)        # 关闭Cursor:        cursor.close()    # 提交事务:        conn.commit()    # 关闭Connection:        conn.close()

scrapy抓取cnblog新闻相关推荐

php 爬取新闻,scrapy抓取学院新闻报告
接到上方任务安排,需要使用scrapy来抓取学院的新闻报告.于是乎,新官上任三把火,对刚学会爬数据的我迫不及待的上手起来. 任务抓取四川大学公共管理学院官网(http://ggglxy.scu.ed ...
python爬百度新闻_13、web爬虫讲解2—Scrapy框架爬虫—Scrapy爬取百度新闻，爬取Ajax动态生成的信息...
crapy爬取百度新闻,爬取Ajax动态生成的信息,抓取百度新闻首页的新闻rul地址有多网站,当你浏览器访问时看到的信息,在html源文件里却找不到,由得信息还是滚动条滚动到对应的位置后才显示信息, ...
Scrapy爬取P2P新闻入门教程
文章目录安装Scrapy 建立Scrapy项目修改配置文件定义Item 编写Spider类运行爬虫参考链接本博客不介绍具体细节,详细入门教程可以看最下方的参考链接,本博客只介绍如何实现一个 ...
使用scrapy抓取博客信息
使用scrapy抓取博客信息本文使用python的爬虫工具scrapy获取博客园发布的文档的信息. 创建cnblog爬虫项目: scrapy startproject cnblog 创建爬虫cnbl ...
python scrapy 抓取脚本之家文章(scrapy 入门使用简介)
老早之前就听说过python的scrapy.这是一个分布式爬虫的框架,可以让你轻松写出高性能的分布式异步爬虫.使用框架的最大好处当然就是不同重复造轮子了,因为有很多东西框架当中都有了,直接拿过来使用就 ...
scrapy抓取淘宝女郎
scrapy抓取淘宝女郎准备工作首先在淘宝女郎的首页这里查看,当然想要爬取更多的话,当然这里要查看翻页的url,不过这操蛋的地方就是这里的翻页是使用javascript加载的,这个就有点尴尬了,找 ...
解决Scrapy抓取中文网页保存为json文件时中文不显示而是显示unicode的问题
注意:此方法跟之前保存成json文件的写法有少许不同之处,注意区分情境再现: 使用scrapy抓取中文网页,得到的数据类型是unicode,在控制台输出的话也是显示unicode,如下所示 {'au ...
scrapy抓取的中文结果乱码解决办法
使用scrapy抓取的结果,中文默认是Unicode,无法显示中文. 中文默认是Unicode,如:\u5317\u4eac\u5927\u5b66 解决办法,原文:http://www.aisun. ...
scrapy抓取淘宝女郎 1
scrapy抓取淘宝女郎准备工作首先在淘宝女郎的首页这里查看,当然想要爬取更多的话,当然这里要查看翻页的url,不过这操蛋的地方就是这里的翻页是使用javascript加载的,这个就有点尴尬了,找 ...

scrapy抓取cnblog新闻

scrapy抓取cnblog新闻相关推荐

最新文章

热门文章