介绍

本文是中国大学MOOC上的Python网络爬虫与信息提取课程中的笔记，是五个基本的爬虫操作，个人觉得其中的方法比较常用，因此记录下来了。

1. 京东商品页面的爬取

代码：

import requests
url = "https://item.jd.com/2967929.html"
try :r = requests.get(url)r.raise_for_status()r.encoding = r.apparent_encodingprint(r.text[:1000])
except :print("爬取失败")

运行结果：

<!DOCTYPE HTML>
<html lang="zh-CN">
<head><!-- shouji --><meta http-equiv="Content-Type" content="text/html; charset=gbk" /><title>【华为荣耀8】荣耀8 4GB+64GB 全网通4G手机 魅海蓝【行情 报价 价格 评测】-京东</title><meta name="keywords" content="HUAWEI荣耀8,华为荣耀8,华为荣耀8报价,HUAWEI荣耀8报价"/><meta name="description" content="【华为荣耀8】京东JD.COM提供华为荣耀8正品行货，并包括HUAWEI荣耀8网购指南，以及华为荣耀8图片
、荣耀8参数、荣耀8评论、荣耀8心得、荣耀8技巧等信息，网购华为荣耀8上京东,放心又轻松" /><meta name="format-detection" content="telephone=no"><meta http-equiv="mobile-agent" content="format=xhtml;    url=//item.m.jd.com/product/2967929.html"><meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/2967929.html"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><link rel="canonical" href="//item.jd.com/2967929.html"/><link rel="dns-prefetch" href="//misc.360buyimg.com"/><link rel="dns-prefetch" href="//static.360buyimg.com"/><link rel="dns-prefetch" href="//img10.360buyimg.com"/><link rel="dns

2. 亚马逊商品页面的爬取

课程中的链接貌似失效了，因此换了个商品链接，代码如下：

import requests
url = "https://www.amazon.cn/gp/product/B07GCKM8XN/ref=s9_acss_bw_cg__1b1_w?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-3&pf_rd_r=Q96W7FJWVCD98PY5RRYN&pf_rd_t=101&pf_rd_p=6f1a9153-ec04-48ac-874b-4069a47b59c5&pf_rd_i=1976614071"
try :headers = {'User-Agent':'Mozilla/5.0'}r = requests.get(url, headers=headers)r.raise_for_status()r.encoding = r.apparent_encodingprint(r.text[1000:2000])
except :print("爬取失败")

运行结果：

       ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],ue_sn = "opfcaptcha.amazon.cn",ue_id = 'M99KXYNJ4HQ9HHPTW6RZ';
}
</script>
</head>
<body><!--To discuss automated access to Amazon data please contact api-services-support@amazon.com.For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
--><!--
Correios.DoNotSend
--><div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important"><div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto"><div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div><div class="a-box a-alert a-alert-info a-spacing-base"><div class="a-box-inner">

3. 百度/360搜索关键字提交

可以通过 params 传递参数，例如，获取搜索 Python 的网页。

百度搜索代码：

import requests
keyword = "Python"
url = "http://www.baidu.com/s"
try :kv = {'wd':keyword}r = requests.get(url, params=kv)print(r.request.url)r.raise_for_status()print(len(r.text))
except :print("爬取失败")

结果：

http://www.baidu.com/s?wd=PythonOcean>
428599

360 搜索代码：

import requests
keyword = "Python"
url = "http://www.so.com/s"
try :kv = {'q':keyword}r = requests.get(url, params=kv)print(r.request.url)r.raise_for_status()print(len(r.text))
except :print("爬取失败")

结果：

https://www.so.com/s?q=Python
370007

4. 网络图片的爬取和存储

这里我换了一个图片的网址，找了一张美女图片的链接，同样，将路径改了一下，使用课程中的不太合适，这里直接保存在当前项目的路径下了，并且固定了文件名称。

代码：

import requests
import os
url = "https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1547834983276&di=99c15aaa5793311a798688c787bd69a2&imgtype=0&src=http%3A%2F%2Fimg05.tooopen.com%2Fimages%2F20150515%2Ftooopen_sy_124474458663.jpg"
root = ".//"
path = root + "beauty.jpg"
try :if not os.path.exists(root) :os.mkdir(root)if not os.path.exists(path) :r = requests.get(url)with open(path, "wb") as f :f.write(r.content)f.close()print("文件保存成功")else :print("文件已存在")
except :print("爬取失败")

结果：

文件保存成功

5. IP地址归属地的自动查询

虽然我们可以直接在网址 http://m.ip138.com/ 上查询 IP 地址，但在代码中还是很困难的。但也是可以实现的。

这里的实现方法与第三个（在百度和360中搜索）是一样的，都是在 url 后面增加参数。这里增加的参数就是 ip=xxx，代码如下：

import requests
url = "http://m.ip138.com/ip.asp?ip="
try :r = requests.get(url+'219.217.224.0')r.raise_for_status()r.encoding = r.apparent_encodingprint(r.text[-500:])
except :print("爬取失败")

结果：

询" class="form-btn" /></form></div><div class="query-hd">ip138.com IP查询(搜索IP地址的地理位置)</div><h1 class="query">您查询的IP：219.217.224.0</h1><p class="result">本站主数据：黑龙江省哈尔滨市 哈
尔滨工业大学 教育网</p><p class="result">参考数据一：黑龙江省哈尔滨市 哈尔滨工业大学</p></div></div><div class="footer"><a href="http://www.miitbeian.gov.cn/" rel="nofollow" target="_blank">沪ICP备10013467号-1</a></div></div><script type="text/javascript" src="/script/common.js"></script></body>
</html>

可以看到，这是哈尔滨工业大学的 IP 地址。

五个简单的 Requests 库爬虫实例相关推荐

python3 requests库爬虫
requests库爬虫 1 安装模块 1.1 使用pip安装requests 2 简单爬取网页的源代码 2.1 引入模块 2.2 获取网页的状态(404.500.200等) 2.3 获取网页源码 3 ...
java 模拟登陆exe_Java简单模拟登陆和爬虫实例---博客园老牛大讲堂
鉴于有人说讲的不清楚,我这里再详细补充一下:更新日期:2017-11-23 本片文章适合初学者,只简单说了一下爬虫怎么用,和一个简单的小实例.不适合你的就可以不看了.----博客园老牛大讲堂 1.什么 ...
python爬虫爬取虎牙数据（简单利用requests库以及Beautifulsoup）
首先打开我们所需要爬取的网站,这里我们挑选的是虎牙直播. 我们今天所爬取的数据就是直播名,直播地址,直播的人,观看数以及直播的类别,如下图所示第一步关于直播的类别,从第一张图片我们可以看到在右边有直 ...
Requests库爬虫详解
关于requests: 官方的解释是:Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用. 作用: Requests 完全满足今日 web 的需求. Keep-Ali ...
requests库怎么安装在python中-python安装requests库的实例代码
requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多因为是第三方库,所以使用前需要cmd安装 pip install requests 安装完成后import一下 ...
Requests库应用实例4：网络图片的爬取与存储(以爬取英雄联盟皮肤图片为例)
网络图片的爬取与存储 1.获取爬取图片的URL 2.代码 3.批量下载lol皮肤图片完整代码 1.获取爬取图片的URL 以艾希的源计划联合为例这个图片的URL:https://game.gtimg. ...
requests库+正则表达式--简单爬虫实例--美女图片篇
正则表达式–简单爬虫实例–4K美女篇文章目录 section1:声明 section2:下载链接分析 section3:代码编写 1.导入板块 2.构造请求 3.正则表达式的构造 4.数据处理 5. ...
爬虫系统基础框架何时使用爬虫框架？ requests库 + bs4来实现简单爬虫
转载请注明出处https://www.cnblogs.com/alexlee666/p/10180519.html,谢谢! 文中图片来自于我的简书博客. 一. 爬虫用途和本质: 网络爬虫顾名思义即模仿 ...
python爬虫实例教程之豆瓣电影排行榜--python爬虫requests库
我们通过requests库进行了简单的网页采集和百度翻译的操作,这一节课我们继续进行案例的讲解–python爬虫实例教程之豆瓣电影排行榜,这次的案例与上节课案例相似,同样会涉及到JSON模块,异步加载 ...

五个简单的 Requests 库爬虫实例

介绍