第一步:创建scrapy项目:

  scrapy startproject Demo

第二步:创建一个爬虫

  

scrapy genspider demo http://auction.artron.net/result/pmh-0-0-2-0-1/

第三步:项目结构:

  

第四部:依次粘贴处各个文件的代码:

  1. demo.py 文件验证码

      

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from Demo.items import *
from bs4 import BeautifulSoup
import time
# import sys
# reload(sys)
# sys.setdefaultencoding('utf-8')
import re
import hashlib
# 加密去重def md5(str):m = hashlib.md5()m.update(str)return m.hexdigest()
#过滤注释信息,去掉换行
def replace(newline):newline = str(newline)newline = newline.replace('\r','').replace('\n','').replace('\t','').replace('   ','').replace('amp;','')re_comment = re.compile('<!--[^>]*-->')newlines = re_comment.sub('', newline)newlines = newlines.replace('<!--','').replace('-->','')return newlinesclass DemoSpider(scrapy.Spider):name = 'demo'allowed_domains = ['http://auction.artron.net/result/']start_urls = ['http://auction.artron.net/result/pmh-0-0-2-0-1/','http://auction.artron.net/result/pmh-0-0-2-0-2/','http://auction.artron.net/result/pmh-0-0-2-0-4/','http://auction.artron.net/result/pmh-0-0-2-0-5/','http://auction.artron.net/result/pmh-0-0-2-0-6/','http://auction.artron.net/result/pmh-0-0-2-0-7/','http://auction.artron.net/result/pmh-0-0-2-0-8/','http://auction.artron.net/result/pmh-0-0-2-0-9/','http://auction.artron.net/result/pmh-0-0-2-0-10/','http://auction.artron.net/result/pmh-0-0-2-0-3/']def parse(self, response):html = response.textsoup = BeautifulSoup(html,'html.parser')result_lists = soup.find_all('ul',attrs={"class":"dataList"})[0]result_lists_replace = replace(result_lists)result_lists_replace = result_lists_replace.decode('utf-8')result_list = re.findall('<ul><li class="name">(.*?)</span></li></ul></li>',result_lists_replace)for ii in result_list:item = DemoItem()auction_name_url = re.findall('<a alt="(.*?)" href="(.*?)" target="_blank" title',ii)[0]auction_name = auction_name_url[0]auction_url = auction_name_url[1]auction_url = "http://auction.artron.net" + auction_urlaucr_name_spider = re.findall('<li class="company"><a href=".*?" target="_blank">(.*?)</a>',ii)[0]session_address_time = re.findall('<li class="city">(.*?)</li><li class="time">(.*?)</li></ul>',ii)[0]session_address = session_address_time[0]item_auct_time = session_address_time[1]hashcode = md5(str(auction_url))create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))item['auction_name'] = auction_nameitem['auction_url'] = auction_urlitem['aucr_name_spider'] = aucr_name_spideritem['session_address'] = session_addressitem['item_auct_time'] = item_auct_timeitem['hashcode'] = hashcodeitem['create_time'] = create_timeprint itemyield item

2.   items.py  文件

   

# -*- coding: utf-8 -*-import scrapyclass DemoItem(scrapy.Item):auction_name = scrapy.Field()auction_url = scrapy.Field()aucr_name_spider = scrapy.Field()session_address = scrapy.Field()item_auct_time = scrapy.Field()hashcode = scrapy.Field()create_time = scrapy.Field()

3.     pipelines.py

    

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
import MySQLdbdef insert_data(dbName,data_dict):try:data_values = "(" + "%s," * (len(data_dict)) + ")"data_values = data_values.replace(',)', ')')dbField = data_dict.keys()dataTuple = tuple(data_dict.values())dbField = str(tuple(dbField)).replace("'",'')conn = MySQLdb.connect(host="10.10.10.77", user="xuchunlin", passwd="ed35sdef456", db="epai_spider_2018", charset="utf8")cursor = conn.cursor()sql = """ insert into %s %s values %s """ % (dbName,dbField,data_values)params = dataTuplecursor.execute(sql, params)conn.commit()cursor.close()conn.close()print "=====  插入成功  ====="return 1except Exception as e:print "********                 插入失败                 ********"print ereturn 0class DemoPipeline(object):def process_item(self, item, spider):dbName = "yachang_auction"data_dict= iteminsert_data(dbName, data_dict)

4. setting.py

  

# -*- coding: utf-8 -*-# Scrapy settings for Demo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Demo'SPIDER_MODULES = ['Demo.spiders']
NEWSPIDER_MODULE = 'Demo.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Demo (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {"Host":"auction.artron.net",# "Connection":"keep-alive",# "Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Referer":"http://auction.artron.net/result/pmh-0-0-2-0-2/","Accept-Encoding":"gzip, deflate","Accept-Language":"zh-CN,zh;q=0.8","Cookie":"td_cookie=2322469817; gr_user_id=84f865e6-466f-4386-acfb-e524e8452c87; gr_session_id_276fdc71b3c353173f111df9361be1bb=ee1eb94e-b7a9-4521-8409-439ec1958b6c; gr_session_id_276fdc71b3c353173f111df9361be1bb_ee1eb94e-b7a9-4521-8409-439ec1958b6c=true; _at_pt_0_=2351147; _at_pt_1_=A%E8%AE%B8%E6%98%A5%E6%9E%97; _at_pt_2_=e642b85a3cf8319a81f48ef8cc403d3b; Hm_lvt_851619594aa1d1fb8c108cde832cc127=1533086287,1533100514,1533280555,1534225608; Hm_lpvt_851619594aa1d1fb8c108cde832cc127=1534298942",}# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'Demo.middlewares.DemoSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'Demo.middlewares.MyCustomDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {'Demo.pipelines.DemoPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5. 爬虫数据库表格:

  

CREATE TABLE `yachang_auction` (`key_id` int(255) NOT NULL AUTO_INCREMENT,`auction_name` varchar(255) DEFAULT NULL,`auction_url` varchar(255) DEFAULT NULL,`aucr_name_spider` varchar(255) DEFAULT NULL,`session_address` varchar(255) DEFAULT NULL,`item_auct_time` varchar(255) DEFAULT NULL,`hashcode` varchar(255) DEFAULT NULL,`create_time` varchar(255) DEFAULT NULL,PRIMARY KEY (`key_id`),UNIQUE KEY `hashcode` (`hashcode`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=230 DEFAULT CHARSET=utf8;

6.数据展示

  

转载于:https://www.cnblogs.com/xuchunlin/p/7253951.html

scrapy 项目实战(一)----爬取雅昌艺术网数据相关推荐

  1. Scrapy框架学习笔记 - 爬取腾讯招聘网数据

    文章目录 一.Scrapy框架概述 (一)网络爬虫 (二)Scrapy框架 (三)安装Scrapy框架 (四)Scrapy核心组件 (五)Scrapy工作流程 二. Scrapy案例演示 (一)爬取目 ...

  2. 小猪的Python学习之旅 —— 14.项目实战:抓取豆瓣音乐Top 250数据存到Excel中

    小猪的Python学习之旅 -- 14.项目实战:抓取豆瓣音乐Top 250数据存到Excel中 标签:Python 一句话概括本文: 利用Excel存储爬到的抓取豆瓣音乐Top 250数据信息,还有 ...

  3. 爬取中国最好大学网数据(Python的Scrapy框架与Xpath联合运用)

    前言        大二上学期学校外出实习,做了一个关于爬取中国最好大学网http://www.zuihaodaxue.com/rankings.html的项目用的这个Scrapy框架,多线程还挺好用 ...

  4. Java爬虫系列之实战:爬取酷狗音乐网 TOP500 的歌曲(附源码)

    在前面分享的两篇随笔中分别介绍了HttpClient和Jsoup以及简单的代码案例: Java爬虫系列二:使用HttpClient抓取页面HTML Java爬虫系列三:使用Jsoup解析HTML 今天 ...

  5. Python爬虫实战 | (1) 爬取猫眼电影官网的TOP100电影榜单

    在本篇博客中,我们将使用requests+正则表达式来爬取猫眼电影官网的TOP100电影榜单,获取每部电影的片名,主演,上映日期,评分和封面等内容. 打开猫眼Top100,分析URL的变化:发现Top ...

  6. 【Scrapy框架实战】爬取网易严选-苹果12手机热评

    Scrapy爬取网易严选-苹果手机热评 1. 前言 2. Scrapy项目创建 3. 网页分析 4. 发送请求 5. 提取信息 6. 模拟翻页 7. 数据保存 8. 结果展示 9. 数据分析 1. 前 ...

  7. scrapy项目2:爬取智联招聘的金融类高端岗位(spider类)

    ---恢复内容开始--- 今天我们来爬取一下智联招聘上金融行业薪酬在50-100万的职位. 第一步:解析解析网页 当我们依次点击下边的索引页面是,发现url的规律如下: 第1页:http://www. ...

  8. Scrapy项目之自动爬取网页信息

    前文已经介绍了利用Scrapy框架与手写爬虫,比较了Scrapy框架的优势.前面介绍的scrapy框架爬取是针对一个网页的爬取,而本文介绍的是实现多个网页的自动爬取,本文将以爬取虎扑湿乎乎论坛帖子信息 ...

  9. python工具箱查询手册书籍京东_十二. 项目实战:爬取京东商城中的书籍信息

    爬取网址:https://search.jd.com/Search?keyword=python 爬取信息:书名,价格,出版社,日期 爬取方式:scrapy框架 + splash 存储方式:csv 页 ...

最新文章

  1. golang sftp传输文件
  2. 【Python教程】装饰器的使用及固定模式
  3. html 页面重复度高,哪些情况容易造成重复页面
  4. 计算机多媒体发展2018,2018秋季学期计算机多媒体项目圆满结课
  5. [转载] 生活小常识 :joke:
  6. Beyond Compare “许可证密钥已被撤销”解决
  7. OmniPlan Pro 4 for Mac(项目流程管理工具)
  8. 阿里巴巴高级技术专家章剑锋:大数据发展的 8 个要点
  9. Kd树实现K近邻算法
  10. java 验证码点击刷新,java验证码及其刷新
  11. java 保存对象_Java将对象保存到文件中/从文件中读取对象
  12. 【Web技术】1352- 如何防止他人恶意调试你的web程序
  13. 联通光猫后台 192.168.1.1登录
  14. UNETR 医学图像分割架构 2D版 (Tensorflow2 Keras 实现UNETR)
  15. 网站域名服务器加密,网站域名利用https防劫持方法
  16. 最高法规定网络转载涉嫌侵权需担责 10月10日起施行
  17. 免费空间如何建设网站?
  18. 保暖防风又抗冻 春节出游当然要选头戴式耳机
  19. 动态代理及JDK代理源码解析
  20. Linux mkdir命令

热门文章

  1. compare用法java,Java经典用法总结
  2. 2022年国家高新技术企业认定最新变化
  3. android电视如何打开adb调试模式,分享解密某Android电视adb后门方法
  4. C++中地递增递减运算符和指针
  5. 爷爷八十大寿,程序员为他写了一个书本朗读App
  6. RF射频卡的介绍与与手机NFC的通信
  7. 成熟职场男人的十大标准
  8. 微信公众平台安全模式消息体签名及加解密PHP代码示例
  9. [RK3288][Android6.0] 不同分辨率的bootanimation.zip下载
  10. DAO开发实战业务分析