介绍

用Scrapy爬了某美剧网站，本来不想爬的。但是这个网站广告太多了，而且最近还把一个页面分成了六个。我每次访问都要打开六个页面，看很多广告，我的破电脑经常卡住，我都快疯了。于是，我自己做了爬虫去爬，爬完了以后，生成一个个没有广告的页面，顿时心情好了 ^_^。

看，都是广告，而且把资源按天分成了六页。

于是，我自己动手，自定义（客製化, customise）了这个网站。下图是效果。

可见自定义以后，页面干净多了。

Demo

Demo下载地址：
http://download.csdn.net/detail/juwikuang/9855793

依赖：Python，Scrapy
运行的时候，只要点run.bat就行了。

代码

#!/usr/bin/python  # -*- coding: utf-8 -*-
"""
Spider against TTMEIJUT.COM
Previously in ttmeiju.com. All the latest TV shows and movies
are presentedin one single page. it is very convinent for users.
However, since maybe last year, ttmeiju splited one single page into
six pages, which it is very anoiying to me.I miss the good old days when there was only one page......Do you? If you do, this script it for you.Created on Sun May 28 12:09:05 2017 @author: Eric Chow
"""
import scrapy
from scrapy import signals class LatestSpider(scrapy.Spider):name = "latest" start_urls = ["http://www.ttmeiju.com/latest-0.html","http://www.ttmeiju.com/latest-1.html","http://www.ttmeiju.com/latest-2.html","http://www.ttmeiju.com/latest-3.html","http://www.ttmeiju.com/latest-4.html","http://www.ttmeiju.com/latest-5.html","http://www.ttmeiju.com/latest-6.html"]#blacklist of the tv showsblacklist =[]#html table rows#an item in rows is like (page number, row number, html object of the row)rows = []@classmethoddef from_crawler(cls, crawler, *args, **kwargs):spider = super(LatestSpider, cls).from_crawler(crawler, *args, **kwargs)crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)crawler.signals.connect(spider.spider_opend, signal=signals.spider_opened)return spiderdef spider_opend(self, spider):#self.initBlacklist()passdef spider_closed(self, spider, reason):html = open("latest.html","w")html.writelines("<html lang=\"en\">")html.writelines("<head>")html.writelines("<title>Eric</title>")#html.writelines("<link rel=\"stylesheet\" href=\"common.css\">")html.writelines("</head>")html.writelines("<table>")self.rows.sort(cmp=self.compareRow)for page_no,row_no, tr in self.rows:html.writelines(tr)    html.writelines("</table>")    passdef parse(self, response):url = response.urlpage_no = url.replace("http://www.ttmeiju.com/latest-","").replace(".html","")page_no = int(page_no)#datedateString = response.css(".active::text")[1].extract().encode("gbk")header_tr = "<tr><th colspan=6>"+dateString+"</th></tr>"self.rows.append((page_no,-1,header_tr))rows = response.css(".latesttable tr")for row_no in range(1,len(rows)):title_u = rows[row_no].css("td")[1].css("a::attr(title)").extract_first()title = title_u.encode("gbk")if self.inBlacklist(title):continuetr = rows[row_no].extract().encode("gbk")tr = tr.replace("/Application/Home/View/Public/static/images/","")tr = tr.replace("href=\"/", "href=\"http://www.ttmeiju.com/")#added 2017-7-31tr = tr.replace("<span class=\"loadspan\"><img width=\"20px;\" src=\"loading.gif\"></span>","")tr = tr.replace("style=\"display:none;\"","")#end added 2017-7-31#if you want to filter out tv shows without subtitles,#uncomment this.#u'\u65e0\u5b57\u5e55' = "wu zi mu" = no subtitlesif u'\u65e0\u5b57\u5e55'.encode("gbk") in tr:continue#if you want to filter out tv shows with subtitles,#uncomment this.#u'\u5185\u5d4c\u53cc\u8bed\u5b57\u5e55' = "nei qian shuang yu zimu"
#            if u'\u5185\u5d4c\u53cc\u8bed\u5b57\u5e55'.encode("gbk") in tr:
#                continue#if you want to filter out tv shows with solution lower than 720p,#uncomment this#u'\u666e\u6e05' = u"pu qing"if u'\u666e\u6e05'.encode("gbk") in tr:continueself.rows.append((page_no,row_no,tr))def initBlacklist(self):fh = open('blacklist.txt')self.blacklist = fh.readlines() fh.close()for i in range(0,len(self.blacklist)):self.blacklist[i] = self.blacklist[i].replace("\n","")def inBlacklist(self,title):for b in self.blacklist:if b in title:return Truereturn Falsedef compareRow(self,a,b):a_p, a_r, a_row = ab_p, b_r, b_row = breturn a_p * 1000 + a_r - b_p *1000 + b_r

请根据自己的需要，自行修改代码。

2017年7月31日更新

2017年7月31日更新，因为对方网站代码更改，相应做出了改变。
对方代码很傻的地方是，页面加载的时候，会把下载地址加载进来。然后才判断是否登陆。这样就变成了防人工访问，不防爬虫。这是故意给爬虫放绿灯么？

我一开始还以为要登录，研究了半天。结果发现，直接用
scrapy shell http://www.ttmeiju.com/latest.html
就可以看到下载地址了，根本不需要登陆。

好傻好天真。

[Scrapy爬虫]自己修改常用网站，去广告，省时间相关推荐

新版RE管理器 (Root Explorer)修改方法（去广告，时间日期排序，默认文件夹优先）
1.由于很多人不是真正的Google Play商店花钱买的,所以安装后有谷歌广告 2.新版的时间日期排列比较蛋疼 3.有些人喜欢默认文件夹优先显示首先你要会反编译,会使用apktool,然后你要下载 ...
关于各种视频网站去广告
问题: 找个几个chrome扩展,发现这些原理都是通过代理替换swf播放器,并且替换crossdomain.xml修改跨域策略. 但绝大多数都与SwitchyOmega冲突. 核心替换规则如下: yo ...
python爬虫打击无良网站弹窗广告
今天又是在网站寻找漂亮小姐姐的一天,发现一个网站还不错,就是有广告在窗口正中间本来也没多大事,点一下就关闭了.但是在我手痒准备拿出我的F12大法的时候这个网站竟然把F12禁用了,这就勾起了我小小的战 ...
在anaconda下创建我的第一个scrapy爬虫——爬取dmoz网站某一网址下的目录的链接名称以及链接地址...
这里我用的python工具是anaconda. 1.首先创建一个scrapy工程: 打开anaconda promt命令行(注意这里不是使用cmd打开windows下的命令行),进入到需要创建工程的目 ...
Scrapy爬虫轻松抓取网站数据
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也 ...
Scrapy爬虫爬取豆瓣TOP250
文章目录分析网页创建Scrapy爬虫框架修改spider脚本修改items脚本修改settings脚本运行使用Scrapy爬虫框架爬取豆瓣电影TOP250 分析网页第一页 start= ...
Python之 - 使用Scrapy建立一个网站抓取器，网站爬取Scrapy爬虫教程
Scrapy是一个用于爬行网站以及在数据挖掘.信息处理和历史档案等大量应用范围内抽取结构化数据的应用程序框架,广泛用于工业. 在本文中我们将建立一个从Hacker News爬取数据的爬虫,并将数据按我 ...
Python之Scrapy爬虫的常用命令
Scrapy爬虫的常用命令: Scrapy命令行是为持续运行设计的专业爬虫框架. 常用的Scrapy,命令有三个: startproject genspider crawl Scrapy为什么采用命令 ...
python的scrapy爬虫可以将爬去的数据放入数据库吗_Python基于Scrapy的爬虫数据采集（写入数据库）...
上一节已经学了如何在spider里面对网页源码进行数据过滤. 这一节将继续学习scrapy的另一个组件-pipeline,用来2次处理数据 (本节中将以储存到mysql数据库为例子) 虽然scrapy ...

[Scrapy爬虫]自己修改常用网站，去广告，省时间

介绍

Demo

代码

2017年7月31日更新

[Scrapy爬虫]自己修改常用网站，去广告，省时间相关推荐

最新文章

热门文章