requests爬虫遇到404怎么办_Python网络爬虫2

这次尝试下怎样搜索电影并解析出磁力链接信息。

开始了！

使用FireFox打开上面的网址，输入要搜索的电影。在点击搜索按钮前记得打开FireBug，并激活“网络”页签。

查看了请求的详情有些哭笑不得：点击搜索按钮后网页跳转到了这样的地址：https://www.torrentkitty.tv/search/蝙蝠侠/——很明显的REST风格的请求。这样，我们要搜什么内容直接将要搜索的内容拼装到请求地址中就行了。搜索的代码是这样的：

#!python

# encoding: utf-8

fromurllibimportrequest

defget(url):

response=request.urlopen(url)

content=""

ifresponse:

content=response.read().decode("utf8")

response.close()

returncontent

defmain():

url="https://www.torrentkitty.tv/search/蝙蝠侠/"

content=get(url)

print(content)

if__name__=="__main__":

main()

执行后报错了，报错信息如下：

Traceback(mostrecentcalllast):

File"D:/PythonDevelop/spider/grab.py",line22,in

main()

File"D:/PythonDevelop/spider/grab.py",line17,inmain

content=get(url)

File"D:/PythonDevelop/spider/grab.py",line7,inget

response=request.urlopen(url)

File"D:\Program Files\python\python35\lib\urllib\request.py",line162,inurlopen

returnopener.open(url,data,timeout)

File"D:\Program Files\python\python35\lib\urllib\request.py",line465,inopen

response=self._open(req,data)

File"D:\Program Files\python\python35\lib\urllib\request.py",line483,in_open

'_open',req)

File"D:\Program Files\python\python35\lib\urllib\request.py",line443,in_call_chain

result=func(*args)

File"D:\Program Files\python\python35\lib\urllib\request.py",line1268,inhttp_open

returnself.do_open(http.client.HTTPConnection,req)

File"D:\Program Files\python\python35\lib\urllib\request.py",line1240,indo_open

h.request(req.get_method(),req.selector,req.data,headers)

File"D:\Program Files\python\python35\lib\http\client.py",line1083,inrequest

self._send_request(method,url,body,headers)

File"D:\Program Files\python\python35\lib\http\client.py",line1118,in_send_request

self.putrequest(method,url,**skips)

File"D:\Program Files\python\python35\lib\http\client.py",line960,inputrequest

self._output(request.encode('ascii'))

UnicodeEncodeError:'ascii'codeccan'tencodecharactersinposition10-12:ordinalnotinrange(128)

根据错误栈信息可以看出是在发送http请求时报错的，是因为编码导致的错误。在python中使用中文经常会遇到这样的问题。因为是在http请求中出现的中文编码异常，所以可以考虑使用urlencode加密。

在python中对字符串进行urlencode使用的是parse库的quote方法，而非是urlencode方法：

defmain():

url="https://www.torrentkitty.tv/search/"+parse.quote("蝙蝠侠")

content=get(url)

再次执行请求，依然报错了：

urllib.error.HTTPError:HTTPError403:Forbidden

报的是HTTP 403错误。这样的错误我遇到过几次，一般是因为没有设置UserAgnet，是网站屏蔽爬虫抓取的一种方式。通过FireBug可以从headers中获取到User-Agent信息：

获取到header信息后再调整下我们的代码，这次会需要使用一个新的类Request：

defget(url,_headers):

req=request.Request(url,headers=_headers)

response=request.urlopen(req)

content=""

ifresponse:

content=response.read().decode("utf8")

response.close()

returncontent

defmain():

headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}

url="https://www.torrentkitty.tv/search/"+parse.quote("蝙蝠侠")

content=get(url,headers)

修改后依然在报错：

socket.timeout:Thereadoperationtimedout

请求超时了，估计是因为网站在境外的缘故。所以还需要设置一个请求超时时间，只需要添加一个参数：

response=request.urlopen(req,timeout=120)

这样调整后终于请求成功了。需要强调下，这里的超时设置的时间单位是秒。

总结下吧，这次一共遇到了三个问题：

中文编码的问题；

HTTP403错误的问题；

请求超时时间设置的问题。

完整的代码在这里，稍稍作了些调整，还添加了post请求的代码。在pot请求的代码中对字典型的参数调用了urlencode方法：

#!python

# encoding: utf-8

fromurllibimportrequest

fromurllibimportparse

DEFAULT_HEADERS={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0"}

DEFAULT_TIMEOUT=120

defget(url):

req=request.Request(url,headers=DEFAULT_HEADERS)

response=request.urlopen(req,timeout=DEFAULT_TIMEOUT)

content=""

ifresponse:

content=response.read().decode("utf8")

response.close()

returncontent

defpost(url,**paras):

param=parse.urlencode(paras).encode("utf8")

req=request.Request(url,param,headers=DEFAULT_HEADERS)

response=request.urlopen(req,timeout=DEFAULT_TIMEOUT)

content=""

ifresponse:

content=response.read().decode("utf8")

response.close()

returncontent

defmain():

url="https://www.torrentkitty.tv/search/"

get_content=post(url,q=parse.quote("蝙蝠侠"))

print(get_content)

get_content=get(url)

print(get_content)

if__name__=="__main__":

main()

就是这样。这次本来是想说些关于网页解析的内容的，不过后来发现还是有很多的内容需要先说明下才好进行下去。关于网页解析的内容就挪到了下一节。

##########

requests爬虫遇到404怎么办_Python网络爬虫2 – 请求中遇到的几个问题相关推荐

python网络爬虫与信息提取视频_Python网络爬虫与信息提取入门5
Part19 实例5:IP地址归属地的自动查询怎么查询一个IP地址的归属呢?比如说某一个IP地址他是来自于北京.上海还是美国呢?我们用一个python 程序来判断.当然你要判断一个地址的归属地,你必 ...
python爬虫微博热搜_Python网络爬虫之爬取微博热搜
微博热搜的爬取较为简单,我只是用了lxml和requests两个库 url= https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&am ...
python网络爬虫爬取视频_Python网络爬虫——爬取小视频网站源视频！自己偷偷看哦！...
学习前提1.了解python基础语法 2.了解re.selenium.BeautifulSoup.os.requests等python第三方库 1.引入库爬取网站视频需要引入的第三方库: impor ...
python爬虫程序详解_Python网络爬虫之三种数据解析方式
指定url 基于requests模块发起请求获取响应对象中的数据进行持久化存储其实,在上述流程中还需要较为重要的一步,就是在持久化存储之前需要进行指定数据解析.因为大多数情况下的需求,我们都会指 ...
python爬虫解析数据包_Python网络爬虫之三种数据解析方式
引入回顾requests实现数据爬取的流程指定url 基于requests模块发起请求获取响应对象中的数据进行持久化存储其实,在上述流程中还需要较为重要的一步,就是在持久化存储之前需要进行指 ...
python爬虫解析数据错误_Python网络爬虫数据解析的三种方式
request实现数据爬取的流程: 指定url 基于request发起请求获取响应的数据数据解析持久化存储 1.正则解析: 常用的正则回顾:https://www.cnblogs.com/wqz ...
爬虫软件python功能_Python 网络爬虫程序详解
#!/usr/bin/python #调用python from sys import argv #导入sys是导入python解释器和他环境相关的参数 from os import makedirs ...
python爬虫登录12306失败_Python网络爬虫(selenium模拟登录12306网站)
一.通过selenium自动登录12306官网 1.1 超级鹰打码平台API,创建chaojiyin.py文件 #!/usr/bin/env python#coding:utf-8 importreq ...
python 如何爬虫wind api数据_Python网络爬虫实战之十：利用API进行数据采集
一.什么是API? API(ApplicationProgrammingInterface,应用程序编程接口)是一些预先定义的函数,目的是提供应用程序与开发人员基于某软件或硬件得以访问一组例程的能力, ...

requests爬虫遇到404怎么办_Python网络爬虫2 – 请求中遇到的几个问题

requests爬虫遇到404怎么办_Python网络爬虫2 – 请求中遇到的几个问题相关推荐

最新文章

热门文章