Python定向爬虫入门

一、基本的正则表达式

正则表达式用来提取爬虫中需要的公共要素

1、正则表达式符号与方法

常用符号：点号、星号、问号与括号
常用方法：findall、search、sub

.:匹配任意字符，换行符\n除外
:匹配前一个字符0次或无限次
?:匹配前一个字符0次或1次
.:贪心算法（吃尽可能的东西）
.*?:非贪心算法（像婴儿少量多餐）
（）:括号内的数据作为结果返回

findall：匹配所有符合规律的内容，返回包含结果的列表
search：匹配并提取第一个符合规律的内容，返回一个正则式表达对象（object）
sub：替换符合规律的内容，返回替换后的值

Python中正则表达式的库文件
import re

点号的使用：
a=‘xy123’
b=re.findall(‘x.’ , a)
print b #xy

c=re.findall(‘x…’ , a)
print c #xy1
点就是占位符，几个点就是几个符号

星号的使用：
a=‘xyxy123’
b=re.findall(‘x*’ , a)
print b #[‘x’,’’,‘x’,’’,’’,’’,’’,’’,’’]
匹配前面的字符，并查找出所有位置

问号的使用：
a=‘xy123’
b=re.findall(‘x？’ , a)
print b #[‘x’,’’,’’,’’,’’,’’,’’,’’]

*.的使用：
search_code=‘hahfajxxixxfalflsjfslfjslfjxxlovexxljsljfsxxyouxxsjflsdjflsj’

a=re.findall(‘xx.*xx’, search_code)
print(a)
#[‘xxixxfalflsjfslfjslfjxxlovexxljsljfsxxyouxx’]
.*在满足规则时，能找多少找多少

.*？的使用：
b=re.findall('xx.？xx’, search_code)
print(b)
#[‘xxixx’,‘xxlovexx’,‘xxyouxx’]
.?少量多餐，首先满足条件，然后尽可能多的挑选尽可能多的组合。

（.*？）的使用： 五星重点
c=re.findall(‘xx（.*？）xx’, search_code)
print©
#[‘i’,‘love’,‘you’]
print c
for each in d:
print each,
#i love you

加入换行符：
s=’’‘sdkfjxxhello
xxfslfjslxxworldxxafjf’’’
d=re.findall(‘xx（.*？）xx’, s)
print d
#[‘fslfjsl’]
原因：.可以匹配任意字符，换行符除外

d=re.findall(‘xx（.*？）xx’, s，re.s)
.匹配任意字符，包含换行符
#[‘hello\n’,‘world’]

对比findall和search的区别：
s2=‘asdfxxIxx123xxlovexxdfd’
f=re.search(‘xx(.xx.?)xx123xx(.?)xx’, s2).group(1)
print(f)
#I

f=re.search(‘xx(.xx.?)xx123xx(.?)xx’, s2).group(2)
print(f)
#love

f2=re.search(‘xx(.xx.?)xx123xx(.?)xx’, s2)
print（f2）
#f2[0][1]
group（）代表的是括号的个数

f为tuple（元组）类型

sub的使用：
s=‘123adkfjslksl23’
output=re.sub(‘123(.?)123’, ‘123789’,s)
output2=re.sub('123(.?)123’, ‘123%d’%12434,s)
print(output)
#123789

otput=re.sub(‘123(.*?)1’, ‘aaaa’,s)
print(otput)
#aaaa23
替换掉与匹配项相同的字符，其余的不变

2、正则表达式的常用技巧

import re
form re import *
from re import findall，search，sub，S（这样写可以省掉re.）

python compile()方法：
compile() 函数将一个字符串编译为字节代码。

匹配数字（\d+）：
a=‘dfsfjsl13233skdfjsldf’
b=re.findall(’(\d+)’,a)
print(b)

3、正则表达式的应用举例

（1）使用findall与search从大量文本中匹配感兴趣的内容
先抓大后抓小
text_fied=re.findall(’

(.*?)

’,html,re.S)[0]
the_text=re.findall（提取内容text_fied ）
需要观察具体的本文内容来进行设计=

（2）使用sub实现翻页功能
for i in range(2, total_pape+1):
new_link=re.sub(‘pageNum=\d+’,'pageNum=%d’i, old_url, re.S)
print new_link

4、python爬虫实战

目标网站：http://www.jikexueyuan.com/
目标内容：课程图片
实现原理：
1、保存网页源代码
2、python读文件加载源代码
3、正则表达式提取图片网址
4、下载图片

二、python单线程爬虫

1、requests介绍与安装

requests：
http for humens
requests库是一个常用的用于http请求的模块，它使用python语言编写，可以方便的对网页进行爬取，是学习python爬虫的较好的http请求模块。
相关链接：https://blog.csdn.net/pittpakk/article/details/81218566
优点：
完美替代了python的urllib2模块
更多的自动化
更友好的用户体验
更完善的功能

安装：
Windows：pip install requests
Linux: sudo pip install requests
anaconda :conda install requests

第三方库安装技巧：
少用easy_install，因为只能安装不能卸载
多用pip方式安装
撞墙了怎么办？请戳：
https://www.lfd.uci.edu/~gohlke/pythonlibs/

界面如下：

在其中搜索需要的pythonlib包，并进行下载。
下载的文件后缀名为.whl，修改后缀名为.zip，并进行解压缩

这里的requests文件就是我们需要的，直接将其放到python目录下的lib文件中就可以了，此时requests的库就可以使用了。（requests-2.6.0.dist-info是无用的）

2、制作网页爬虫

requests获取网页源代码：
（1）直接获取源代码
（2）修改http头获取源代码

import requests
html=requests.get('https://www.easyicon.net/')
print html.txt

修改http头获取源代码：（反侦察）

import requestsheader={‘User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36’}
#获取方式见下方html=requests.get('https://www.easyicon.net/'， headers=header)
html.encoding='utf-8'
print html.txt

打开想要爬取的网页，右键审核元素/检查，找到network，点击网址进行刷新，出现新的链接时，随便选择一个，下滑找到request headers，复制过来即可。

requests与正则表达式：
使用request获取网页源代码，再使用正则表达式匹配出感兴趣的内容，这就是单线程简单爬虫的基本原理。

3、向网页提交数据

get与post介绍：
get是从服务器上获取数据
post是向服务器传送数据
get通过构造url中的参数来实现功能

requests表单提交：
核心方法：requests.post
核心步骤：构造表单-提交表单-获取返回信息

show more：异步加载

打开审核元素：找到network，然后点击show more，会看到headers离requests method的方法为post。
下滑会看到from data，page为2.

常规获取网页数据的方法：

import requests
import re
url=‘https://www.crowdfunder.com/browse/deals’
url2='https://www.crowdfunder.com/browse/deals&template=false'  #也来源于networkhtml=requests.get(url).text
print(html)new method:(网页包含异步加载的)
data={
entities_only='true'
'page':'1'   #可修改
}
html_post=requests.post(url,data=data)
title=re.findall('"card-title">(.*?)</di>', html_post.test,re.S)
for each in title:print(each)

4、实战-爬虫

涉及知识：
requests获取网页
re.sub换页
正则表达式匹配内容