python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）

这两天学习了python3实现抓取网页资源的方法，发现了很多种方法，所以，今天添加一点小笔记。

文章最后为各位小伙伴提供超级彩蛋！不要错过了！ 1、最简单

import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
复制代码

2、使用 Request

import urllib.requestreq = urllib.request.Request('http://python.org/')
response = urllib.request.urlopen(req)
the_page = response.read()
复制代码

3、发送数据

#! /usr/bin/env python3import urllib.parse
import urllib.requesturl = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'act' : 'login','login[email]' : 'yzhang@i9i8.com','login[password]' : '123456'}data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header('Referer', 'http://www.python.org/')
response = urllib.request.urlopen(req)
the_page = response.read()print(the_page.decode("utf8"))
复制代码

4、发送数据和header

#! /usr/bin/env python3import urllib.parse
import urllib.requesturl = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'act' : 'login','login[email]' : 'yzhang@i9i8.com','login[password]' : '123456'}
headers = { 'User-Agent' : user_agent }data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()print(the_page.decode("utf8"))
复制代码

5、http 错误

#! /usr/bin/env python3import urllib.requestreq = urllib.request.Request('http://www.python.org/fish.html')
try:urllib.request.urlopen(req)
except urllib.error.HTTPError as e:print(e.code)print(e.read().decode("utf8"))
复制代码

6、异常处理1

#! /usr/bin/env python3from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://twitter.com/")
try:response = urlopen(req)
except HTTPError as e:print('The server couldn\'t fulfill the request.')print('Error code: ', e.code)
except URLError as e:print('We failed to reach a server.')print('Reason: ', e.reason)
else:print("good!")print(response.read().decode("utf8"))
复制代码

7、异常处理2

#! /usr/bin/env python3from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("http://twitter.com/")
try:response = urlopen(req)
except URLError as e:if hasattr(e, 'reason'):print('We failed to reach a server.')print('Reason: ', e.reason)elif hasattr(e, 'code'):print('The server couldn\'t fulfill the request.')print('Error code: ', e.code)
else:print("good!")print(response.read().decode("utf8"))
复制代码

8、HTTP 认证

#! /usr/bin/env python3import urllib.request# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://cms.tetx.com/"
password_mgr.add_password(None, top_level_url, 'yzhang', 'cccddd')handler = urllib.request.HTTPBasicAuthHandler(password_mgr)# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)# use the opener to fetch a URL
a_url = "https://cms.tetx.com/"
x = opener.open(a_url)
print(x.read())# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)a = urllib.request.urlopen(a_url).read().decode('utf8')
print(a)
复制代码

9、使用代理

#! /usr/bin/env python3import urllib.requestproxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)a = urllib.request.urlopen("http://g.cn").read().decode("utf8")
print(a)
复制代码

10、超时

#! /usr/bin/env python3import socket
import urllib.request# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('http://twitter.com/')
a = urllib.request.urlopen(req).read()
print(a)
复制代码

超多Python免费资料领取！看下面！需要的小伙伴加美女姐姐的微信：kele22558！

转载于:https://juejin.im/post/5b7686415188253345137109

python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）相关推荐

python抓资源_python3 抓取网页资源的 N 种方法
转自:https://www.cnblogs.com/goldd/p/5457229.html 1.最简单 import urllib.request response = urllib.reques ...
抓取网页数据的几种方法
相信所有个人网站的站长都有抓取别人数据的经历吧,目前抓取别人网站数据的方式无非两种方式: 一.使用第三方工具,其中最著名的是火车头采集器,在此不做介绍. 二.自己写程序抓取,这种方式要求站长自己写程序 ...
转载自android 开发--抓取网页解析网页内容的若干方法(网络爬虫)（正则表达式）
转载自http://blog.csdn.net/sac761/article/details/48379173 android 开发--抓取网页解析网页内容的若干方法(网络爬虫)(正则表达式) 标签: ...
python抓取网页内容到excel_Python实现抓取网页生成Excel文件的方法示例
本文实例讲述了Python实现抓取网页生成Excel文件的方法.分享给大家供大家参考,具体如下: Python抓网页,主要用到了PyQuery,这个跟jQuery用法一样,超级给力示例代码如下: # ...
正则表达式抓取网页资源
分享一个工具类,用户抓取网页上的图片.js.css等路径传入 package lab2; import java.util.ArrayList; import java.util.List; imp ...
【精华】【经典】自动化循环操作方法当前网页方法，可以用于本地化AI智能自动抓取网页资源信息，类似爬虫功能
第一种:需要刷新当前页面重复执行的操作--使用场景:刷点击率 .秒杀活动.抢沙发 //自动化循环操作方法 var doLoop = function (dom) {dom || (dom = docu ...
python3自动爬取网页资源并保存为epub电子书
使用Python获取网页内容并生成EPUB格式电子书前言 EPUB的介绍思路代码示例前言现在的有很多电子书都不能下载了,只能在线看,但是广告又多.所以想编个程序抓取这些内容生成电子书. EP ...
用requests获取网页源代码 python-Python3使用requests包抓取并保存网页源码的方法
本文实例讲述了Python3使用requests包抓取并保存网页源码的方法.分享给大家供大家参考,具体如下: 使用Python 3的requests模块抓取网页源码并保存到文件示例: import r ...
php正则获取li,用正则表达式抓取网页中的ul 和 li标签中最终的值！
获取你要抓取的页面 const string URL = "http://www.hn3ddf.gov.cn/price/GetList.html?pageno=1"; ...

python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）

python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）相关推荐

最新文章

热门文章