Python 爬虫学习笔记

环境篇

Python3 + Pip 环境配置
MongoDB 、MYSQL、Redis 环境配置
爬虫常用库安装

基础篇

基本原理

什么是爬虫：请求网站并提取数据的自动化程序

爬虫基本流程：发起请求获取响应内容解析内容保存数据

>>> import requests
>>> response = requests.get('https://www.baidu.com')
>>> print(response.text)
<!DOCTYPE html>
...............

抓怎样的数据：HTML 文档、JSON 格式文本、图片、视频、其他

解析方式：直接处理、Json 解析、正则表达式、BeautifulSoup、PyQuery、Xpath

解决 JavaScript 渲染的问题：分析 AJAX 请求、Splash、PyV8、Ghost.py

怎样保存数据：文本（纯文本、Json、xml等）、关系型数据库（MySQL、Oracle、SQL server）、非关系型数据库（MongoDB、Redis）

Urllib 库基本使用

什么是 Urllib：Python 内置的 HTTP 请求库 urllib.request 请求模块、urllib.error 异常处理模块、 urllib.parse url 解析模块

相比 Python2 变化

// Python2
import urllib2
response = urllib2.urlopen('http://www.baidu.com')// Python3
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')

用法详解

""" 请求 """
import urllib.requestresponse = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

""" POST 请求 """
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {'User-Agent': 'tttt','Host': 'httpbin.org'
}
dict = {'name': 'm0bu'
}
data = bytes(parse.urlencode(dict), encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
#req.add_header('','')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

""" 异常处理 """
import socket
import urllib.request
import urllib.error
try:response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
except urllib.error.URLError as e:if isinstance(e.reason, socket.timeout):print('time out')

""" 代理 """
from urllib import request
proxy_handler = request.ProxyHandler({'http': 'http://127.0.0.1:1080','https': 'https://127.0.0.1:1080'
})
opener = request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

""" Cookie """
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
respone = opener.open('http://www.baidu.com')
for item in cookie:print(item.name+"="+item.value)

Requests 库基本使用

什么是 Requests：Python 实现的简单易用的 HTTP 库

""" 带参数 GET 请求 """
import requests
data = {'name': 'm0bu'
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)

""" 解析 JSON """
import requests
import json
response = requests.get("http://httpbin.org/get")
print(response.json())
print(json.loads(response.text))

""" 二进制数据 """
import requests
response = requests.get('https://github.com/favicon.ico')
print(response.content)
with open('favicon.ico','wb') as f:f.write(response.content)f.close()

""" 添加 headers,POST 请求 """
import requests
import json
data = {"name": "m0bu"}
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36"
}
response = requests.post('http://httpbin.org/post', headers=headers, data=data)
print(response.json())

""" 状态码判断 """
import requests
response = requests.get('http://httpbin.org/')
exit() if not response.status_code == 200 else print('200')

""" 文件上传 """
import requests
files ={'file':open('favicon.ico','rb')}
response = requests.post("http://httpbin.org/post",files=files)
print(response.text)

""" 获取 cookie """
import requests
r = requests.get("https://www.baidu.com")
print(r.cookies)
for key,value in r.cookies.items():print(key +'='+value)

""" 会话维持 """
import requests
s = requests.Session()
s.get("http://httpbin.org/cookies/set/number/12345678")
r=s.get('http://httpbin.org/cookies')
print(r.text)

""" 证书验证 """
import requests
from requests.packages import urllib3
urllib3.disable_warnings()
r = requests.get('https://www.12306.cn', verify=False)
print(r.status_code)

""" 代理设置 """
import requests
proxies ={"http":"http://127.0.0.1:1080","https":"https://127.0.0.1:1080"
}
r = requests.get('http://httpbin.org/ip',proxies=proxies)
print(r.text)

""" 异常处理 """
import requests
from requests.exceptions import ReadTimeout,HTTPError,RequestException
try:r = requests.get("http://httpbin.org/get", timeout=0.1)print(r.status_code)
except ReadTimeout:print("timeout")
except HTTPError:print('http error')
except RequestException:print('error')

正则表达式基础

什么是正则表达式：正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑

非 Python 独有，re 模块实现

尽量使用泛匹配、使用括号得到匹配目标、尽量使用非贪婪模式、有换行符就用 re.S

为匹配方便，能用 search 就不用 match，group() 打印输出结果

re.findall 搜索字符串，以列表形式返回全部能匹配的字串

re.compile 将一个正则表达式串编译成正则对象，以便复用该匹配模式

""" 小练习 """

BeautifulSoup 库详解

灵活有方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

解析器：Python 标准库、lxml HTML 解析器、lxml XML 解析器、html5lib

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

推荐使用 lxml 解析库，必要时使用 html.parser

标签选择筛选功能弱但是速度快

建议使用 find()、find_all() 查询匹配单个结果或者多个结果

如果对 CSS 选择器熟悉建议使用 select()

记住常用的获取属性和文本值的方法

PyQuery 详解

强大灵活的网页解析库。熟悉 jQuery 语法，建议使用 PyQuery

from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head))

Selenium 详解

自动化测试工具，支持多种浏览器。爬虫中主要用来解决 JavaScript 渲染问题

from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
print(browser.page_source)
browser.close()

官方文档

实战篇

Requests + 正则表达式爬取猫眼电影

目标站点分析，流程框架：

1.抓取单页内容

2.正则表达式分析

3.保存至文件

4.开启循环及多线程

import requests
from multiprocessing import Pool
from requests.exceptions import RequestException
import re
import jsondef get_one_page(url):try:response = requests.get(url)if response.status_code == 200:return response.textreturn Noneexcept RequestException:return Nonedef parse_one_page(html):pattern = re.compile('<dd>.*?board-index.*?(\d+)</i>.*?data-src="(.*?)".*?name"><a'+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+ '.*?integer">(.*?).*?fraction">(.*?)</i>.*?</dd>', re.S)items = re.findall(pattern, html)for item in items:yield{'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5]+item[6]}def write_to_file(content):with open('result.txt', 'a', encoding='utf-8') as f:f.write(json.dumps(content, ensure_ascii=False)+'\n')f.close()def main(offset):url = 'https://maoyan.com/board/4?offset=' + str(offset)html = get_one_page(url)for item in parse_one_page(html):print(item)write_to_file(item)if __name__ == "__main__":pool = Pool()pool.map(main, [i*10 for i in range(10)])print(end-start)

转载于:https://www.cnblogs.com/skrr/p/11055821.html

Python 爬虫学习笔记相关推荐

python爬虫学习笔记 3.9 （了解参考：训练Tesseract）
python爬虫学习笔记 3.9 (了解参考:训练Tesseract) 参考阅读:训练Tesseract 要使用 Tesseract 的功能,比如后面的示例中训练程序识别字母,要先在系统中设置一个新 ...
Python爬虫学习笔记 -- 爬取糗事百科
Python爬虫学习笔记 -- 爬取糗事百科代码存放地址: https://github.com/xyls2011/python/tree/master/qiushibaike 爬取网址:https ...
python爬虫学习笔记3.2-urllib和request练习
python爬虫学习笔记3.2-urllib和request练习一.urllib练习 1.百度贴吧案例需求分析手动测试查询流程观察页面分析特殊部分 https://tieba.baidu. ...
python爬虫学习笔记 1.9 （Handler处理器和自定义Opener）
python爬虫学习笔记 1.1(通用爬虫和聚焦爬虫) python爬虫学习笔记 1.2 ( HTTP和HTTPS ) python爬虫学习笔记 1.3 str和bytes的区别 python爬虫学习 ...
python爬虫学习笔记2模拟登录与数据库
前言为了加入学校里面一个技术小组,我接受了写一个爬取学校网站通知公告的任务.这个任务比以前写的爬虫更难的地方在于,需要模拟登录才能获得页面,以及将得到的数据存入数据库. 本文按照日期来记录我完成任务 ...
Python爬虫学习笔记总结(一)
〇. python 基础先放上python 3 的官方文档:https://docs.python.org/3/ (看文档是个好习惯) 关于python 3 基础语法方面的东西,网上有很多,大家可以 ...
一入爬虫深似海，总结python爬虫学习笔记！
正文〇. python 基础先放上python 3 的官方文档:https://docs.python.org/3/ (看文档是个好习惯) 关于python 3 基础语法方面的东西,网上有很多,大 ...
一入爬虫深似海，总结python爬虫学习笔记！ 1
正文〇. python 基础先放上python 3 的官方文档:https://docs.python.org/3/ (看文档是个好习惯) 关于python 3 基础语法方面的东西,网上有很多,大 ...
Python爬虫学习笔记：概念、知识和简单应用
Python爬虫:概念.知识和简单应用什么是爬虫? 使用的开发工具一个简单的爬虫实例 Web请求过程分析 HTTP协议请求头中的重要内容响应头中的重要内容请求方式 requests 安装 G ...
python爬虫学习笔记一：网络爬虫入门
参考书目 <python网络爬虫从入门到实践>唐松第一章网络爬虫入门 1.1 robots协议举例:查看京东的robots协议京东robots协议地址 User-agent: * ...

Python 爬虫学习笔记

环境篇

基础篇

实战篇

Python 爬虫学习笔记相关推荐

最新文章

热门文章