Python + Scrapy 小小爬虫有大大梦想

Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。可以应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

1.scrapy安装

pip install scrapy

2.scrapy中文文档

http://scrapy-chs.readthedocs.org/zh_CN/0.24/intro/overview.html

3.scrapy执行步骤

大致步骤如下：
1.新建项目：
scrapy startproject Spider
2.明确抓取内容：
修改Item，定义需要抓取的数据
class Information(scrapy.Item):title = scrapy.Field() body = scrapy.Field()author = scrapy.Field()source = scrapy.Field()time = scrapy.Field()
3.编写爬虫：
from scrapy.spider import Spider  class DmozSpider(Spider):  name = "demo"  allowed_domains = ["dmoz.org"]  start_urls = [  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"  ]  def parse(self, response):  filename = response.url.split("/")[-2]  open(filename, 'wb').write(response.body) 4.开启爬虫：
scrapy crawl demo
5.存储抓取内容
scrapy crawl demo-o items.json -t json

4.小小爬虫有大大梦想

1.文档结构
C:.
│  items.json
│  scrapy.cfg
│
└─Spider│  items.py│  items.pyc│  pipelines.py│  pipelines.pyc│  settings.py│  settings.pyc│  __init__.py│  __init__.pyc│└─spidersInformation_spider.pyInformation_spider.pyc__init__.py__init__.pyc2.爬虫Information_spider.py文件,主要是用来抓取数据
# -*- coding:utf-8 -*-
from scrapy.spider import Spider, Rule, Request
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from Spider.items import Information
from scrapy import log
from bs4 import BeautifulSoup
import datetimeclass Information_Spider(Spider):name = "csdn"allowed_domains = ["csdn.net"]# 搜索关键字categories = ["python", u"测试"]start_urls = ["http://so.csdn.net/so/search/s.do?q=" + "".join(categories[0]) + "&t=blog","http://so.csdn.net/so/search/s.do?q=" + "".join(categories[1]) + "&t=blog"]rules = [# Rule(SgmlLinkExtractor(allow=('')), callback='parse_article', follow=True)]# 获取热门博客下一页链接def parse(self, response):base_url = "http://so.csdn.net/so/search/s.do"soup = BeautifulSoup(response.body, 'html.parser')links = soup.find("span", "page-nav").find_all("a")print u"**获取热门博客下一页链接**\n"for link in links:href = base_url + link.get("href")# 将抓取的链接调用parse_link方法进行下一轮抓取yield Request(href, callback=self.parse_link)# 获取热门博客链接def parse_link(self, response):soup = BeautifulSoup(response.body, 'html.parser')links = soup.find_all("dl", "search-list")print u"**获取热门博客链接**\n"print linksfor link in links:href = link.find("dt").find("a").get("href")# 将抓取的链接调用parse_article方法进行下一轮抓取yield Request(href, callback=self.parse_article)# 获取文章def parse_article(self, response):items = []soup = BeautifulSoup(response.body, 'html.parser')base_url = "http://blog.csdn.net"# 抓取文章时间time = datetime.datetime.today().strftime('%Y-%m-%d')# 获取文章标题title_block = soup.find("span", "link_title").find("a")title = title_block.get_text().encode("utf-8")# 获取文章链接title_link_detail = title_block.get("href")title_link = base_url + title_link_detail# 获取文章作者author_block = soup.find("div", {"id": "blog_userface"}).find("span").find("a")author = author_block.get_text()# 获取文章内容body_div = soup.find("div", "markdown_views")if body_div is None:body_div = soup.find("div", "article_content")body_block = body_div.find_all("p")article = ""for body in body_block:article += body.get_text().encode("utf-8") + "\n"# 将抓取内容存储if len(article) != 0:item = Information()item["title"] = titleitem["body"] = articleitem["author"] = authoritem["source"] = title_linkitem["time"] = timeitems.append(item)return items   3.爬虫items.py文件，定义抓取的数据
# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass SpiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()passclass Information(scrapy.Item):title = scrapy.Field()body = scrapy.Field()author = scrapy.Field()source = scrapy.Field()time = scrapy.Field()4.爬虫pipelines.py文件，将抓取的数据存储在Mysql中
# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdbclass SpiderPipeline(object):# 连接120.24.239.214数据库def __init__(self):self.conn = MySQLdb.connect(host="120.24.239.214",user="root",passwd="***********",   #密码还是需要保密db="Teman",port=3306)self.cur = self.conn.cursor()# 将抓取items的数据存储mysqldef process_item(self, item, spider):try:information_title = item["title"].strip()information_body = item["body"].replace("\n", "<br/>")information_author = item["author"]information_source = item["source"]information_time = item["time"]# 过滤已经添加的文章sql_select_source = "select * from information where source = \"" + "".join(str(information_source)) + "\""self.cur.execute(sql_select_source)judge_source = self.cur.fetchall()sql_select_title = "select * from information where title = \"" + "".join(str(information_title)) + "\""self.cur.execute(sql_select_title)judge_title = self.cur.fetchall()if len(judge_source) == 0 or len(judge_title) == 0:sql = "insert into information(title, body, author, source, time) values(\"" + "".join(str(information_title))\+ "\",\"" + "".join(str(information_body)) + "\",\"" + "".join(str(information_author)) + "\",\"" + \"".join(str(information_source)) + "\",\"" + "".join(str(information_time)) + "\")"self.cur.execute(sql)sql = ""self.conn.commit()except MySQLdb.Error, e:print ereturn item# 关闭mysql连接def close_spider(self, spider):self.cur.close()self.conn.close()

5.总结

小小爬虫有大大梦想，希望大家将爬虫发展起来，用处大家懂的

阳台测试： 239547991（群号）

本人博客：http://xuyangting.sinaapp.com/

Python + Scrapy 小小爬虫有大大梦想相关推荐

python scrapy框架爬虫_Python Scrapy爬虫框架学习
Scrapy 是用Python实现一个为爬取网站数据.提取结构性数据而编写的应用框架. 一.Scrapy框架简介 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数 ...
python scrapy框架爬虫_Python Scrapy爬虫框架
Scrapy爬虫框架结构: 数据流的3个路径: 一: 1.Engine从Spider处获得爬取请求(Request) 2.Engine将爬取请求转发给Scheduler,用于调度二: 3.Engin ...
python scrapy框架爬虫_Scrapy爬虫框架教程（一）-- Scrapy入门
前言转行做python程序员已经有三个月了,这三个月用Scrapy爬虫框架写了将近两百个爬虫,不能说精通了Scrapy,但是已经对Scrapy有了一定的熟悉.准备写一个系列的Scrapy爬虫教程,一 ...
Python Scrapy简单爬虫-爬取澳洲药店，代购党的福音
身在澳洲,近期和ld决定开始做代购,一拍即合之后开始准备工作.众所周知,澳洲值得买的也就那么点东西,奶粉.UGG.各种保健品,其中奶粉价格基本万年不变,但是UGG和保健品的价格变化可能会比较大.所以, ...
python scrapy框架爬虫当当图书网
最近在复习scrapy框架,就随便找了个网站做了一下爬虫,当当网,说实话这种网站还是比较好做爬虫的,我没加代理,也没限速,没写多线程,就直接搞下来了,数据量还是比较可观的.接下来进入正题: 先看一下整 ...
Python Scrapy - Ins爬虫
前言上午写完那篇文章后,下午在睡觉,晚上就想试试scrapy比较一下速度,那个更快,我是第一次用scrapy下载图片,第一次我使用requests下载的...贼鸡儿慢,就是单线程:后来翻了翻文档按照 ...
mysql scrapy 重复数据_大数据python（scrapy）爬虫爬取招聘网站数据并存入mysql后分析...
基于Scrapy的爬虫爬取腾讯招聘网站岗位数据视频(见本头条号视频) 根据TIOBE语言排行榜更新的最新程序语言使用排行榜显示,python位居第三,同比增加2.39%,为什么会越来越火,越来越受欢迎 ...
python使用scrapy开发爬虫
爬虫初步接触梦想还是要有的,万一实现了呢? 前置技能 Xpath 使用路径表达式在 XML 文档中进行导航,简单的说,就是获取dom节点 python 的简单语法学会使用pip3 安装缺少的模块 ...
python创建scrapy_Python爬虫教程-31-创建 Scrapy 爬虫框架项目
首先说一下,本篇是在 Anaconda 环境下,所以如果没有安装 Anaconda 请先到官网下载安装 Scrapy 爬虫框架项目的创建0.打开[cmd] 1.进入你要使用的 Anaconda 环境1 ...
python scrapy 入门,10分钟完成一个爬虫
在TensorFlow热起来之前,很多人学习python的原因是因为想写爬虫.的确,有着丰富第三方库的python很适合干这种工作. Scrapy是一个易学易用的爬虫框架,尽管因为互联网多变的复杂性仍 ...

Python + Scrapy 小小爬虫有大大梦想

Python + Scrapy 小小爬虫有大大梦想相关推荐

最新文章

热门文章