scrapy 项目实战(一)----爬取雅昌艺术网数据
第一步:创建scrapy项目:
scrapy startproject Demo
第二步:创建一个爬虫
scrapy genspider demo http://auction.artron.net/result/pmh-0-0-2-0-1/
第三步:项目结构:
第四部:依次粘贴处各个文件的代码:
1. demo.py 文件验证码
# -*- coding: utf-8 -*- import scrapy from scrapy import Request from Demo.items import * from bs4 import BeautifulSoup import time # import sys # reload(sys) # sys.setdefaultencoding('utf-8') import re import hashlib # 加密去重def md5(str):m = hashlib.md5()m.update(str)return m.hexdigest() #过滤注释信息,去掉换行 def replace(newline):newline = str(newline)newline = newline.replace('\r','').replace('\n','').replace('\t','').replace(' ','').replace('amp;','')re_comment = re.compile('<!--[^>]*-->')newlines = re_comment.sub('', newline)newlines = newlines.replace('<!--','').replace('-->','')return newlinesclass DemoSpider(scrapy.Spider):name = 'demo'allowed_domains = ['http://auction.artron.net/result/']start_urls = ['http://auction.artron.net/result/pmh-0-0-2-0-1/','http://auction.artron.net/result/pmh-0-0-2-0-2/','http://auction.artron.net/result/pmh-0-0-2-0-4/','http://auction.artron.net/result/pmh-0-0-2-0-5/','http://auction.artron.net/result/pmh-0-0-2-0-6/','http://auction.artron.net/result/pmh-0-0-2-0-7/','http://auction.artron.net/result/pmh-0-0-2-0-8/','http://auction.artron.net/result/pmh-0-0-2-0-9/','http://auction.artron.net/result/pmh-0-0-2-0-10/','http://auction.artron.net/result/pmh-0-0-2-0-3/']def parse(self, response):html = response.textsoup = BeautifulSoup(html,'html.parser')result_lists = soup.find_all('ul',attrs={"class":"dataList"})[0]result_lists_replace = replace(result_lists)result_lists_replace = result_lists_replace.decode('utf-8')result_list = re.findall('<ul><li class="name">(.*?)</span></li></ul></li>',result_lists_replace)for ii in result_list:item = DemoItem()auction_name_url = re.findall('<a alt="(.*?)" href="(.*?)" target="_blank" title',ii)[0]auction_name = auction_name_url[0]auction_url = auction_name_url[1]auction_url = "http://auction.artron.net" + auction_urlaucr_name_spider = re.findall('<li class="company"><a href=".*?" target="_blank">(.*?)</a>',ii)[0]session_address_time = re.findall('<li class="city">(.*?)</li><li class="time">(.*?)</li></ul>',ii)[0]session_address = session_address_time[0]item_auct_time = session_address_time[1]hashcode = md5(str(auction_url))create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))item['auction_name'] = auction_nameitem['auction_url'] = auction_urlitem['aucr_name_spider'] = aucr_name_spideritem['session_address'] = session_addressitem['item_auct_time'] = item_auct_timeitem['hashcode'] = hashcodeitem['create_time'] = create_timeprint itemyield item
2. items.py 文件
# -*- coding: utf-8 -*-import scrapyclass DemoItem(scrapy.Item):auction_name = scrapy.Field()auction_url = scrapy.Field()aucr_name_spider = scrapy.Field()session_address = scrapy.Field()item_auct_time = scrapy.Field()hashcode = scrapy.Field()create_time = scrapy.Field()
3. pipelines.py
# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import json import MySQLdbdef insert_data(dbName,data_dict):try:data_values = "(" + "%s," * (len(data_dict)) + ")"data_values = data_values.replace(',)', ')')dbField = data_dict.keys()dataTuple = tuple(data_dict.values())dbField = str(tuple(dbField)).replace("'",'')conn = MySQLdb.connect(host="10.10.10.77", user="xuchunlin", passwd="ed35sdef456", db="epai_spider_2018", charset="utf8")cursor = conn.cursor()sql = """ insert into %s %s values %s """ % (dbName,dbField,data_values)params = dataTuplecursor.execute(sql, params)conn.commit()cursor.close()conn.close()print "===== 插入成功 ====="return 1except Exception as e:print "******** 插入失败 ********"print ereturn 0class DemoPipeline(object):def process_item(self, item, spider):dbName = "yachang_auction"data_dict= iteminsert_data(dbName, data_dict)
4. setting.py
# -*- coding: utf-8 -*-# Scrapy settings for Demo project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'Demo'SPIDER_MODULES = ['Demo.spiders'] NEWSPIDER_MODULE = 'Demo.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'Demo (+http://www.yourdomain.com)'# Obey robots.txt rules ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default) #COOKIES_ENABLED = False# Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False# Override the default request headers: DEFAULT_REQUEST_HEADERS = {"Host":"auction.artron.net",# "Connection":"keep-alive",# "Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36","Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Referer":"http://auction.artron.net/result/pmh-0-0-2-0-2/","Accept-Encoding":"gzip, deflate","Accept-Language":"zh-CN,zh;q=0.8","Cookie":"td_cookie=2322469817; gr_user_id=84f865e6-466f-4386-acfb-e524e8452c87; gr_session_id_276fdc71b3c353173f111df9361be1bb=ee1eb94e-b7a9-4521-8409-439ec1958b6c; gr_session_id_276fdc71b3c353173f111df9361be1bb_ee1eb94e-b7a9-4521-8409-439ec1958b6c=true; _at_pt_0_=2351147; _at_pt_1_=A%E8%AE%B8%E6%98%A5%E6%9E%97; _at_pt_2_=e642b85a3cf8319a81f48ef8cc403d3b; Hm_lvt_851619594aa1d1fb8c108cde832cc127=1533086287,1533100514,1533280555,1534225608; Hm_lpvt_851619594aa1d1fb8c108cde832cc127=1534298942",}# Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = {# 'Demo.middlewares.DemoSpiderMiddleware': 543, #}# Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = {# 'Demo.middlewares.MyCustomDownloaderMiddleware': 543, #}# Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None, #}# Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'Demo.pipelines.DemoPipeline': 300, }# Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
5. 爬虫数据库表格:
CREATE TABLE `yachang_auction` (`key_id` int(255) NOT NULL AUTO_INCREMENT,`auction_name` varchar(255) DEFAULT NULL,`auction_url` varchar(255) DEFAULT NULL,`aucr_name_spider` varchar(255) DEFAULT NULL,`session_address` varchar(255) DEFAULT NULL,`item_auct_time` varchar(255) DEFAULT NULL,`hashcode` varchar(255) DEFAULT NULL,`create_time` varchar(255) DEFAULT NULL,PRIMARY KEY (`key_id`),UNIQUE KEY `hashcode` (`hashcode`) USING BTREE ) ENGINE=InnoDB AUTO_INCREMENT=230 DEFAULT CHARSET=utf8;
6.数据展示
转载于:https://www.cnblogs.com/xuchunlin/p/7253951.html
scrapy 项目实战(一)----爬取雅昌艺术网数据相关推荐
- Scrapy框架学习笔记 - 爬取腾讯招聘网数据
文章目录 一.Scrapy框架概述 (一)网络爬虫 (二)Scrapy框架 (三)安装Scrapy框架 (四)Scrapy核心组件 (五)Scrapy工作流程 二. Scrapy案例演示 (一)爬取目 ...
- 小猪的Python学习之旅 —— 14.项目实战:抓取豆瓣音乐Top 250数据存到Excel中
小猪的Python学习之旅 -- 14.项目实战:抓取豆瓣音乐Top 250数据存到Excel中 标签:Python 一句话概括本文: 利用Excel存储爬到的抓取豆瓣音乐Top 250数据信息,还有 ...
- 爬取中国最好大学网数据(Python的Scrapy框架与Xpath联合运用)
前言 大二上学期学校外出实习,做了一个关于爬取中国最好大学网http://www.zuihaodaxue.com/rankings.html的项目用的这个Scrapy框架,多线程还挺好用 ...
- Java爬虫系列之实战:爬取酷狗音乐网 TOP500 的歌曲(附源码)
在前面分享的两篇随笔中分别介绍了HttpClient和Jsoup以及简单的代码案例: Java爬虫系列二:使用HttpClient抓取页面HTML Java爬虫系列三:使用Jsoup解析HTML 今天 ...
- Python爬虫实战 | (1) 爬取猫眼电影官网的TOP100电影榜单
在本篇博客中,我们将使用requests+正则表达式来爬取猫眼电影官网的TOP100电影榜单,获取每部电影的片名,主演,上映日期,评分和封面等内容. 打开猫眼Top100,分析URL的变化:发现Top ...
- 【Scrapy框架实战】爬取网易严选-苹果12手机热评
Scrapy爬取网易严选-苹果手机热评 1. 前言 2. Scrapy项目创建 3. 网页分析 4. 发送请求 5. 提取信息 6. 模拟翻页 7. 数据保存 8. 结果展示 9. 数据分析 1. 前 ...
- scrapy项目2:爬取智联招聘的金融类高端岗位(spider类)
---恢复内容开始--- 今天我们来爬取一下智联招聘上金融行业薪酬在50-100万的职位. 第一步:解析解析网页 当我们依次点击下边的索引页面是,发现url的规律如下: 第1页:http://www. ...
- Scrapy项目之自动爬取网页信息
前文已经介绍了利用Scrapy框架与手写爬虫,比较了Scrapy框架的优势.前面介绍的scrapy框架爬取是针对一个网页的爬取,而本文介绍的是实现多个网页的自动爬取,本文将以爬取虎扑湿乎乎论坛帖子信息 ...
- python工具箱查询手册书籍京东_十二. 项目实战:爬取京东商城中的书籍信息
爬取网址:https://search.jd.com/Search?keyword=python 爬取信息:书名,价格,出版社,日期 爬取方式:scrapy框架 + splash 存储方式:csv 页 ...
最新文章
- golang sftp传输文件
- 【Python教程】装饰器的使用及固定模式
- html 页面重复度高,哪些情况容易造成重复页面
- 计算机多媒体发展2018,2018秋季学期计算机多媒体项目圆满结课
- [转载] 生活小常识 :joke:
- Beyond Compare “许可证密钥已被撤销”解决
- OmniPlan Pro 4 for Mac(项目流程管理工具)
- 阿里巴巴高级技术专家章剑锋:大数据发展的 8 个要点
- Kd树实现K近邻算法
- java 验证码点击刷新,java验证码及其刷新
- java 保存对象_Java将对象保存到文件中/从文件中读取对象
- 【Web技术】1352- 如何防止他人恶意调试你的web程序
- 联通光猫后台 192.168.1.1登录
- UNETR 医学图像分割架构 2D版 (Tensorflow2 Keras 实现UNETR)
- 网站域名服务器加密,网站域名利用https防劫持方法
- 最高法规定网络转载涉嫌侵权需担责 10月10日起施行
- 免费空间如何建设网站?
- 保暖防风又抗冻 春节出游当然要选头戴式耳机
- 动态代理及JDK代理源码解析
- Linux mkdir命令
热门文章
- compare用法java,Java经典用法总结
- 2022年国家高新技术企业认定最新变化
- android电视如何打开adb调试模式,分享解密某Android电视adb后门方法
- C++中地递增递减运算符和指针
- 爷爷八十大寿,程序员为他写了一个书本朗读App
- RF射频卡的介绍与与手机NFC的通信
- 成熟职场男人的十大标准
- 微信公众平台安全模式消息体签名及加解密PHP代码示例
- [RK3288][Android6.0] 不同分辨率的bootanimation.zip下载
- DAO开发实战业务分析