（原）python爬虫入门（2）---排序爬取的辽宁科技大学热点新闻

发现科大网页的源码中还有文章的点击率，何不做一个文章点击率的降序排行。简单，前面入门（1）基本已经完成我们所要的功能了，本篇我们仅仅需要添加：一个通过正则获取文章点击率的数字；再加一个根据该数字的插入排序。ok，大功告成！
简单说一下本文插入排序的第一个循环，找到列表中最大的数，放到列表 0 的位置做观察哨。

上代码：

# -*- coding: utf-8 -*-
# 程序：爬取点击排名前十的科大热点新闻
# 版本：0.1
# 时间：2014.06.30
# 语言：python 2.7
#---------------------------------import string,urllib2,re,sys
#解决这个错误：UnicodeEncodeError: 'ascii' codec can't encode characters in position 32-34: ordinal not in range(128)
reload(sys)
sys.setdefaultencoding('utf-8')class USTL_Spider:def __init__(self,url,num=10):self.myUrl=url#存放获取的标题和网址self.datas=[]self.num=numprint 'The Spider is Starting!'def ustl_start(self):myPage=urllib2.urlopen(self.myUrl+'.html').read().decode('gb2312')if myPage==None:print 'No such is needed!'return#首先获得总的页数endPage=self.find_endPage(myPage)if endPage==0:return#处理第一页的数据
        self.deal_data(myPage)#处理除第一页之外的所有数据
        self.save_data(self.myUrl,endPage)#获取总的页数def find_endPage(self,myPage):#找到网页源码中带有尾页的一行。eg: >8</font> xxxxx title="尾页"#匹配中文，需要utf-8格式，并且变成ur''。#.*?：非贪婪匹配任意项#re.S：正则表达式的 . 可以匹配换行符myMatch=re.search(ur'>8</font>(.*?)title="尾页"',myPage,re.S)endPage=0if myMatch:#找到带尾页行中的数字。eg：xxxx_ NUM .htmlendPage=int(re.match(r'(.*?)_(\d+).html',myMatch.group(1),re.S).group(2))else:print 'Cant get endPage!'return endPage#将列表中元组依次写入到我的d盘tests文件夹sort_ustl.txt文件上def save_data(self,url,endPage):self.get_data(url,endPage)f=open("d:\\tests\\sort_ustl.txt",'w')for item in self.datas:f.write(item[1]+', '+item[0])f.close()print 'Over!'#提取每个网页def get_data(self,url,endPage):for i in range(2,endPage+1):print 'Now the spider is crawling the %d page...' % i#字符串做decode时候，加'ignore'忽略非法字符myPage=urllib2.urlopen(self.myUrl+'_'+str(i)+'.html').read().decode('gb2312','ignore')if myPage==None:print 'No such is needed!'returnself.deal_data(myPage)#获得我们想要的字符串，追加到datas中def deal_data(self,myPage):#这里我们想要的是文章标题，网址和点击率。将（标题网址，点击率）元组添加到datas列表中，对datas进行插入排序myItems=re.findall(r'<TD width=565>.*?href="(.*?)">(.*?)</a>.*?class=textthick2> (\d+)</font>',myPage,re.S)for site,title,click in myItems:self.datas.append(('%s :%5swww.ustl.edu.cn%s\n' %(title,' ',site),click))self.insert_sort()#插入排序，只需要点击排名前self.num(默认是10)的文章。def insert_sort(self):for i in range(len(self.datas)-1,0,-1):if int(self.datas[i][1])>int(self.datas[i-1][1]):tmp=self.datas[i]self.datas[i]=self.datas[i-1]self.datas[i-1]=tmpfor i in range(2,len(self.datas)):v=self.datas[i]j=iwhile int(v[1])>int(self.datas[j-1][1]):self.datas[j]=self.datas[j-1]j-=1self.datas[j]=vdel self.datas[self.num:len(self.datas)]#我们需要爬取的网页
ustl=USTL_Spider('http://www.ustl.edu.cn/news/news/RDXW')
#ustl=USTL_Spider('http://www.ustl.edu.cn/news/news/ZHXX')
ustl.ustl_start()

不足：我想当第一页运行过插入排序后，在其他页进行插入之前，可以直接将小于已排序列表中最后一个元素的元素直接pass，不必在放到datas中。

结果截图：
参考资料：

　　　　1.python list的方法你需要看看吧：

http://www.cnblogs.com/zhengyuxin/articles/1938300.html

　　　　2.python 默认参数值（主要是这本书挺好的，薄薄薄薄薄薄。。）：

http://woodpecker.org.cn/abyteofpython_cn/chinese/ch07s04.html

转载于:https://www.cnblogs.com/jhooon/p/3818079.html

（原）python爬虫入门（2）---排序爬取的辽宁科技大学热点新闻相关推荐

Python爬虫入门 | 7 分类爬取豆瓣电影，解决动态加载问题
比如我们今天的案例,豆瓣电影分类页面.根本没有什么翻页,需要点击"加载更多"新的电影信息,前面的黑科技瞬间被秒-- 又比如知乎关注的人列表页面: 我复制了其中两个人昵称 ...
Python爬虫入门——2. 2爬取酷狗音乐top1-500歌曲信息
有了第一个程序的基础,我们现在来爬取酷狗音乐top500的歌曲信息.连接http://www.kugou.com/yy/rank/home/1-8888.html 我们第一个程序只爬取了一个页面的数据 ...
python爬虫教程网-python爬虫入门10分钟爬取一个网站
一.基础入门 1.1什么是爬虫爬虫(spider,又网络爬虫),是指向网站/网络发起请求,获取资源后分析并提取有用数据的程序. 从技术层面来说就是通过程序模拟浏览器请求站点的行为,把站点返回的HT ...
python爬虫入门教程：爬取网页图片
在现在这个信息爆炸的时代,要想高效的获取数据,爬虫是非常好用的.而用python做爬虫也十分简单方便,下面通过一个简单的小爬虫程序来看一看写爬虫的基本过程: 准备工作语言:python IDE:py ...
python爬虫入门实战！爬取博客文章标题和链接！
最近有小伙伴和我留言想学python爬虫,那么就搞起来吧. 准备阶段爬虫有什么用呢?举个最简单的小例子,你需要<战狼2>的所有豆瓣影评.最先想的做法可能是打开浏览器,进入该网站,找到评论 ...
Python爬虫入门教程：爬取妹子图网站 - 独行大佬
妹子图网站---- 安装requests打开终端:使用命令pip3 install requests等待安装完毕即可使用接下来在终端中键入如下命令?123# mkdir demo # cd demo# ...
小白Python爬虫入门实例1——爬取中国最好大学排名
中国大学慕课python网络爬虫与信息提取--定向爬虫"中国最好大学排名信息爬取" 由于课程中老师给的案例有些许瑕疵,加之至今该网页的首页已经更新,原网址已不存在,因此笔者在老师给 ...
给小白的python爬虫入门之批量爬取别样网的视频素材
网络爬虫,听起来很神秘,其实也不过如此,简单来说,只要网站开放了端口即用户能访问这个网站,那么无论这个网站的反爬机制做的有多么好,只要你的技术够,总会有机会破解它. 换句话说,不是你的网站很安全,而是 ...
Python爬虫入门-小试ImagesPipeline爬取pixabay和煎蛋之为什么是‘404’
第一部分.利用ImagesPipeline爬取pixabay里面的美女图进入pixabay网站在搜索框中输入关键字beauty,并且简化一下URL中的参数: 尝试进行翻页,可见URL中只有一个参数p ...

（原）python爬虫入门（2）---排序爬取的辽宁科技大学热点新闻

（原）python爬虫入门（2）---排序爬取的辽宁科技大学热点新闻相关推荐

最新文章

热门文章