爬虫第六篇：scrapy框架爬取某书网整站爬虫爬取

新建项目

# 新建项目$ scrapy startproject jianshu# 进入到文件夹
$ cd jainshu# 新建spider文件
$ scrapy genspider -t crawl jianshu_spider jainshu.com

items.py文件

import scrapyclass ArticleItem(scrapy.Item):title = scrapy.Field()content = scrapy.Field()article_id = scrapy.Field()origin_url = scrapy.Field()author = scrapy.Field()avatar = scrapy.Field()pub_time = scrapy.Field()

jianshu_spider.py文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from jianshu.items import ArticleItemclass JianshuSpiderSpider(CrawlSpider):name = 'jianshu_spider'allowed_domains = ['jianshu.com']start_urls = ['https://www.jianshu.com/']rules = (Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),)def parse_detail(self, response):title = response.xpath("//h1[@class='title']/text()").get()content = response.xpath("//div[@class='show-content-free']").get()avatar = response.xpath("//a[@class='avatar']/img/@src").get()author = response.xpath("//div[@class='info']/span/a/text()").get()pub_time = response.xpath("//span[@class='publish-time']/text()").get()article_id = response.url.split("?")[0].split("/")[-1]origin_url = response.urlitem = ArticleItem(title=title,content=content,avatar=avatar,pub_time=pub_time,article_id=article_id,origin_url=origin_url,author=author)yield item

同步的MySQL插入数据

import pymysqlclass JianshuPipeline(object):def __init__(self):dbparams = {'host': '127.0.0.1','user': 'root','password': '123456','database': 'jianshu','port': 3306,'charset': 'utf8'}self.conn = pymysql.connect(**dbparams)self.cursor = self.conn.cursor()self._sql = Nonedef process_item(self, item, spider):self.cursor.execute(self.sql, (item['title'], item['content'], item['author'], item['avatar'], \item['pub_time'], item['origin_url'], item['article_id']))self.conn.commit()return item@propertydef sql(self):if not self._sql:self._sql = """insert into article(title,content, author, avatar, pub_time, origin_url, article_id) values (%s, %s, %s, %s, %s, %s,%s)"""return self._sqlreturn self._sql

异步的MySQL插入数据

from twisted.enterprise import adbapi
from pymysql import cursors
class JianshuTwistedPipeline(object):def __init__(self):dbparams = {'host': '127.0.0.1','user': 'root','password': '123456','database': 'jianshu','port': 3306,'charset': 'utf8','cursorclass': cursors.DictCursor}self.dbpool = adbapi.ConnectionPool('pymysql', **dbparams)self._sql = None@propertydef sql(self):if not self._sql:self._sql = """insert into article(title,content, author, avatar, pub_time, origin_url, article_id) values (%s, %s, %s, %s, %s, %s,%s)"""return self._sqlreturn self._sqldef process_item(self, item, spider):defer = self.dbpool.runInteraction(self.insert_item, item)defer.addErrback(self.handle_error, item, spider)def insert_item(self, cursor, item):cursor.execute(self.sql, (item['title'], item['content'], item['author'], item['avatar'], \item['pub_time'], item['origin_url'], item['article_id']))def handle_error(self, error, item, spider):print('=' * 10 + 'error' + '=' * 10)print(error)print('=' * 10 + 'error' + '=' * 10)

转载于:https://www.cnblogs.com/leijing0607/p/8075324.html

爬虫第六篇：scrapy框架爬取某书网整站爬虫爬取相关推荐

爬虫第六讲 Scrapy框架
文章目录爬虫第六讲 Scrapy框架一.Scrapy框架 Scrapy简介工作流程 Scrapy入门 pipline使用 1.scrapy.Request知识点 2.item的介绍和使用 3. ...
python爬虫实战：利用scrapy，短短50行代码下载整站短视频
近日,有朋友向我求助一件小事儿,他在一个短视频app上看到一个好玩儿的段子,想下载下来,可死活找不到下载的方法.这忙我得帮,少不得就抓包分析了一下这个app,找到了视频的下载链接,帮他解决了这个小问题 ...
python pipeline框架_爬虫(十六)：Scrapy框架(三) Spider Middleware、Item Pipeline|python基础教程|python入门|python教程...
https://www.xin3721.com/eschool/pythonxin3721/ 1. Spider Middleware Spider Middleware是介入到Scrapy的Spid ...
在scrapy框架中如何设置开放代理池达到反爬的目的
我们在随机爬取某个网站的时候,比如对网站发出成千上万次的请求,如果每次访问的ip都是一样的,就很容易被服务器识别出你是一个爬虫.因此在发送请求多了之后我们就要设置ip代理池来随机更换我们的ip地址,使 ...
爬虫练习-爬取简书网热评文章
前言: 使用多进程爬虫方法爬取简书网热评文章,并将爬取的数据存储于MongoDB数据库中本文为整理代码,梳理思路,验证代码有效性--2020.1.17 环境: Python3(Anaconda3) ...
python 简书_python爬取简书网文章的方法
python爬取简书网文章的方法发布时间:2020-06-30 14:37:08 来源:亿速云阅读:100 作者:清晨这篇文章主要介绍python爬取简书网文章的方法,文中示例代码介绍的非常详细 ...
quotes 整站数据爬取存mongo
安装完成scrapy后爬取部分信息已经不能满足躁动的心了,那么试试http://quotes.toscrape.com/整站数据爬取第一部分项目创建 1.进入到存储项目的文件夹,执行指令 scra ...
简书python_使用 Python 爬取简书网的所有文章
01 抓取目标我们要爬取的目标是「简书网」. 打开简书网的首页,随手点击一篇文章进入到详情页面. 我们要爬取的数据有:作者.头像.发布时间.文章 ID 以及文章内容. 02 准备工作在编写爬虫 ...
[python爬虫之路day19:] scrapy框架初入门day1——爬取百思不得姐段子
好久没学习爬虫了,今天再来记录一篇我的初入门scrapy. 首先scrapy是针对大型数据的爬取,简单便捷,但是需要操作多个文件以下介绍: 写一个爬虫,需要做很多的事情.比如: 发送网络请求, 数据解 ...
Python网络爬虫数据采集实战：Scrapy框架爬取QQ音乐存入MongoDB
通过前七章的学习,相信大家对整个爬虫有了一个比较全貌的了解 ,其中分别涉及四个案例:静态网页爬取.动态Ajax网页爬取.Selenium浏览器模拟爬取和Fillder今日头条app爬取,基本 ...

爬虫第六篇：scrapy框架爬取某书网整站爬虫爬取

新建项目

items.py文件

同步的MySQL插入数据

异步的MySQL插入数据

爬虫第六篇：scrapy框架爬取某书网整站爬虫爬取相关推荐

最新文章

热门文章