Python网络爬虫之基本项目：爬取网易新闻排行榜

1. 最基本的抓取

抓取大多数情况属于get请求，即直接从对方服务器上获取数据。

首先，Python中自带urllib及urllib2这两个模块，基本上能满足一般的页面抓取。另外，requests也是非常有用的包，与此类似的，还有httplib2等等。

Requests：import requestsresponse = requests.get(url)content = requests.get(url).contentprint "response headers:", response.headersprint "content:", content
Urllib2：import urllib2response = urllib2.urlopen(url)content = urllib2.urlopen(url).read()print "response headers:", response.headersprint "content:", content
Httplib2：import httplib2http = httplib2.Http()response_headers, content = http.request(url, 'GET')print "response headers:", response_headersprint "content:", content

此外，对于带有查询字段的url，get请求一般会将来请求的数据附在url之后，以?分割url和传输数据，多个参数用&连接。

data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests：data为dict，jsonimport requestsresponse = requests.get(url=url, params=data)
Urllib2：data为stringimport urllib, urllib2    data = urllib.urlencode(data)full_url = url+'?'+dataresponse = urllib2.urlopen(full_url)

抓取网易新闻排行榜项目一些说明：

使用urllib2或requests包来爬取页面。
使用正则表达式分析一级页面，使用Xpath来分析二级页面。
将得到的标题和链接，保存为本地文件。

代码分享：

# -*- coding: utf-8 -*-
import os
import sys
import urllib2
import requests
import re
from lxml import etreedef StringListSave(save_path, filename, slist):if not os.path.exists(save_path):os.makedirs(save_path)path = save_path+"/"+filename+".txt"with open(path, "w+") as fp:for s in slist:fp.write("%s\t\t%s\n" % (s[0].encode("utf8"), s[1].encode("utf8")))def Page_Info(myPage):'''Regex'''mypage_Info = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', myPage, re.S)return mypage_Infodef New_Page_Info(new_page):'''Regex(slowly) or Xpath(fast)'''# new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)\.html".*?>(.*?)</a></td>', new_page, re.S)# # new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)">(.*?)</a></td>', new_page, re.S) # bugs# results = []# for url, item in new_page_Info:#     results.append((item, url+".html"))# return resultsdom = etree.HTML(new_page)new_items = dom.xpath('//tr/td/a/text()')new_urls = dom.xpath('//tr/td/a/@href')assert(len(new_items) == len(new_urls))return zip(new_items, new_urls)def Spider(url):i = 0print "downloading ", urlmyPage = requests.get(url).content.decode("gbk")# myPage = urllib2.urlopen(url).read().decode("gbk")myPageResults = Page_Info(myPage)save_path = u"网易新闻抓取"filename = str(i)+"_"+u"新闻排行榜"StringListSave(save_path, filename, myPageResults)i += 1for item, url in myPageResults:print "downloading ", urlnew_page = requests.get(url).content.decode("gbk")# new_page = urllib2.urlopen(url).read().decode("gbk")newPageResults = New_Page_Info(new_page)filename = str(i)+"_"+itemStringListSave(save_path, filename, newPageResults)i += 1if __name__ == '__main__':print "start"start_url = "http://news.163.com/rank/"Spider(start_url)print "end"

①2000多本Python电子书有
②Python开发环境安装教程有
③Python400集+自学视频有
④软件开发常用词汇有
⑤Python学习路线图有
⑥项目游戏源码案例分享有
如果你用得到的话可以直接拿走，在我的QQ技术交流群里（技术交流和资源共享，广告勿
入，不要让我搞废你的群）可以自助拿走，群号是924403856。

Python网络爬虫之基本项目：爬取网易新闻排行榜相关推荐

python爬网易新闻_Python爬虫实战教程：爬取网易新闻；爬虫精选高手技巧
Python爬虫实战教程:爬取网易新闻:爬虫精选高手技巧发布时间:2020-02-21 17:42:43 前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有, ...
python爬网易新闻_爬虫基本介绍 python3 爬虫爬取网易新闻排行榜
爬虫基本介绍 1. 什么是爬虫? 爬虫是请求⽹网站并提取数据的⾃自动化程序 2. 爬虫的基本流程发起请求通过HTTP库向目标站点发起请求,即发送一个Request,请求可以包含额外的headers ...
19. python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求 [前期准备] 2.分析及代码实现 (1)获取五大板块详情页url (2)解析每个板块 (3)解析每个模块里的标题中详情页信息 1.需 ...
python爬取热门新闻每日排行_爬取网易新闻排行榜
#网络爬虫之最基本的爬虫:爬取[网易新闻排行榜](http://news.163.com/rank/) **一些说明:** * 使用urllib2或requests包来爬取页面. * 使用正则表达式分 ...
scrapy获取a标签的连接_python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求[前期准备] 2.分析及代码实现(1)获取五大板块详情页url(2)解析每个板块(3)解析每个模块里的标题中详情页信息点击此处,获取 ...
【python 爬虫】 scrapy 入门--爬取百度新闻排行榜
scrapy 入门–爬取百度新闻排行榜环境要求:python2/3(anaconda)scrapy库开发环境:sublime text + windows cmd 下载scrapy(需要pytho ...
python爬网易新闻_Python爬虫实战教程：爬取网易新闻
前言本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. 作者: Amauri PS:如有需要Python学习资料的小伙伴可以加点击 ...
Python网络爬虫requests、bs4爬取空姐图片，福利哦
Scrapy框架很好,也提供了很多扩展点,可以自己编写中间件处理Scrapy的Request和Response.但是可定制化或者可掌控性来说,还是自己写的爬虫更加强一些. 接下来,我们来看一下使用Py ...
Python网络爬虫：利用正则表达式爬取豆瓣电影top250排行前10页电影信息
在学习了几个常用的爬取包方法后,转入爬取实战. 爬取豆瓣电影早已是练习爬取的常用方式了,网上各种代码也已经很多了,我可能现在还在做这个都太土了,不过没事,毕竟我也才刚入门-- 这次我还是利用正则表达式 ...

Python网络爬虫之基本项目：爬取网易新闻排行榜

Python网络爬虫之基本项目：爬取网易新闻排行榜相关推荐

最新文章

热门文章