模式匹配与正则表达式（一）

文章目录

1、创建正则表达式对象
2、匹配Regex对象
3、正则表达式匹配复习
4、用正则表达式匹配更多模式
- （1）利用括号分组
- （2）用管道匹配多个分组
- （3）用问号实现可选匹配
- （4）用星号匹配0次或者多次
- （5）用加号匹配1次或者多次
- （6）用花括号匹配特定次数
- （7）贪心和非贪心匹配
- （8）findall()方法
- （9）字符分类
- （10）建立自己的字符分类
- （11）插入字符和$符号
- （12）通配字符.
- （13）不区分大小写的匹配


def isPhoneNumber(text):if len(text) != 12:return Falsefor i in range(0, 3):if not text[i].isdecimal():return Falseif text[3] != '-':return Falsefor i in range(4, 7):if not text[i].isdecimal():return Falseif text[7] != '-':return Falsefor i in range(8, 12):if not text[i].isdecimal():return Falsereturn Truemessage = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):chunk = message[i: i+12]if isPhoneNumber(chunk):print('phone number found:' + chunk)
print('Done')

这是匹配一种模式的电话号码，格式就非常麻烦。所以使用正则表达式

1、创建正则表达式对象

python中所有正则表达式的函数都在re模块中
向re.compile()传入一个字符串值，表示正则表达式，返回一个Regex模式对象

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
print(type(phoneNumRegex))

2、匹配Regex对象

Regex对象的serch()方法查找传入的字符串，以寻找该正则表达式的所有匹配。
查找失败，返回None。查找成功返回一个Mach对象。
Match对象有一个group（）方法，返回被查找字符串中实际匹配的文本。

import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found:' + mo.group())

3、正则表达式匹配复习

（1）用import re导入正则表达式模块
（2）用re.compile()函数创建一个Regex对象（记得使用原始字符串）
（3）向Regex对象的search()方法传入想查找的字符串，它返回一个Match对象。
（4）调用Match对象的group()方法，返回实际匹配文本的字符串。

4、用正则表达式匹配更多模式

（1）利用括号分组

import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('Phone number found:' + mo.group(1))
print('Phone number found:' + mo.group(2))
print('Phone number found:' + mo.group(0))
print('Phone number found:' + mo.group())

正则表达式字符串中的第一对括号是第一分组，第二对括号是第二分组
向group()方法传数字，表示匹配第几个分组

mo.groups（） #一次打印所有分组

转义字符
. ^ $ * + ? { } [ ] \ | ( )都具有特殊意义，如果要匹配这些字符，需要用到转义字符\

（2）用管道匹配多个分组

字符 | 称为“管道”，希望匹配许多表达式中的一个时，就可以使用它
正则表达式 r’Batman | Tina Fey’ 将匹配 ‘Batman’ 或者 ‘Tina Fey’
如果都出现在字符串中，则匹配第一次出现的文本。

import reheroRegex = re.compile(r'Batman | Tina Fey')
mo = heroRegex.search('Batman and Tina Fey.')
print(mo.group())

也可以使用管道来匹配多个模式的一个

heroRegex = re.compile(r'Bat(man | mobile | copter | bat)')
mo = heroRegex.search('Batman and Tina Fey and Batmobile.')
print(mo.group())

（3）用问号实现可选匹配

不论这段文本在不在，正则表达式都会认为匹配。字符?表示它前面这个分组是可选的

import re
batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
print(mo1.group())mo2 = batRegex.search('The Adventures of Batwoman')
print(mo2.group())

换句说法，这里wo出现的次数可以是0 也可以是1

（4）用星号匹配0次或者多次

*意味着“匹配0次或者多次”，即 *之前的分组可以在文本中出现任意次

import re
batRegex = re.compile(r'Bat(wo)*man')
mo = batRegex.search('The Adventures of Batman')
print(mo.group())
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwowowowoman')
print(mo2.group())

（5）用加号匹配1次或者多次

+号要求前面的分组必须“至少出现一次”

import re
batRegex = re.compile(r'Bat(wo)+man')
mo = batRegex.search('The Adventures of Batman')
print(mo)
mo1 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())
mo2 = batRegex.search('The Adventures of Batwowowowoman')
print(mo2.group())

如果search没有找到，则会返回None

（6）用花括号匹配特定次数

在正则表达式中 该分组的后面 跟上花括号包围的数字。
(Ha){3} 匹配三次
除了数字，还可以是范围
(Ha){3,5}，匹配3 、4、5次
也可以不写最小值或者最大值，表示不限定最小值或者最大值
(Ha){3,}匹配3次或者更多次
(Ha){,5}将匹配0~5次

(Ha){3} == (Ha) (Ha) (Ha)

（7）贪心和非贪心匹配

python的正则表达式默认是贪心的
在字符串’HaHaHaHaHa’中，（Ha）{3,5}可以匹配到3，4，5个，但是返回的是’HaHaHaHaHa’，因为贪心

花括号的“非贪心” 版本尽可能匹配最短的字符串，即在结束的花括号后跟着一个问号

import re
greedyHaRegex = re.compile(r'(Ha){3,5}')
mo = greedyHaRegex.search('HaHaHaHaHa')
print(mo.group())nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo1 = nongreedyHaRegex.search('HaHaHaHaHa')
print(mo1.group())

（8）findall()方法

search()返回一个Match对象，包含被查找字符串中的“第一次”匹配的文本。
findall()方法将返回一组字符串，包含被查找字符串中的所有匹配的文本，返回的是一个字符串里列表，条件是在正则表达式中没有分组。


import re
phnoeNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phnoeNumRegex.search('Cell: 415-555-9999 work: 212-555-0000')
print(mo.group())
mo1 = phnoeNumRegex.findall('Cell: 415-555-9999 work: 212-555-0000')
print(type(mo1))
print(mo1)# 如果正则表达式中有分组，则返回元组的列表phnoeNumRegex1 =re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
mo2 = phnoeNumRegex1.findall('Cell: 415-555-9999 work: 212-555-0000')
print(type(mo2))
print(mo2)

如果没有分组，返回一个匹配字符串的列表
如果有分组，返回一个字符串的元组的列表

（9）字符分类

\d: 0~9 的任何数字
\D: 除0~9以外的任意字符
\w: 任意字母、数字、下划线(word)
\W:除任意字母、数字、下划线以外的字符
\s: 空格、制表符或换行符(空白字符:sepcial)
\S: 除空白字符以外的字符

import re
xmasRegex = re.compile(r'\d+\s\w+')
mo = xmasRegex.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8 maids, 7 swans, g gesse, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
print(mo)

（10）建立自己的字符分类

可以用方括号定义自己的字符分类。
[aeiouAEIOU]将匹配所有的元音字符，且不区分大小
也可以使用短横线表示字母或者数字的范围。
[a-zA-Z0-9]将匹配所有小写、大写、数字
在字符分类的左方括号加上^，就可以得到非字符类
r’[^aeiouAEIOU]'表示所有非元音字符

（11）插入字符和$符号

^:表明匹配必须发生在被查找文本开始处

import re
beginsWithHello = re.compile(r'^hello')
mo = beginsWithHello.search('hello world')
print(mo)
mo1 = beginsWithHello.search('He said hello.')
print(mo1)

$:表明必须以这个正则表达式的模式结尾

import re
endsWithNumber = re.compile(r'\d$')
mo = endsWithNumber.search('Your number is 42')
print(mo)mo1 = endsWithNumber.search('Your number is forty two')
print(mo1)

同时使用 ^ 和 $ 表示整个字符串必须匹配该模式

import re
wholeStringIsNum = re.compile(r'^\d+$')
mo = wholeStringIsNum.search('123456789')
print(mo)mo1 = wholeStringIsNum.search('13dsaf12')
print(mo1)

（12）通配字符.

.(点)：匹配换行符之外的所有字符(一个点匹配一个字符)

import re
atRegex = re.compile(r'.at')
mo = atRegex.findall('The cat in the hat sat on the flat mat.')
print(mo)

.*(点星):匹配所有字符

import re
nameRegex = re.compile(r'First Name:(.*) Last Name:(.*)')
mo = nameRegex.search('First Name: AL Last Name: Sweigart')
print(mo.group())nongreedyRegex = re.compile(r'<.*?>')
mo1 = nongreedyRegex.search('<To serve man> for dinner.>')
print(mo1.group())greedyRegex = re.compile(r'<.*>')
mo2 = greedyRegex.search('<To serve man> for dinner.>')
print(mo2.group())

点-星 : 匹配换行符外的所有字符。
传入re.DOTALL 作为re.comopile()的第二个参数，可以让句点字符匹配所以字符，包括换行符

import re
noNewLineRegex = re.compile('.*')
print(noNewLineRegex.search('Server the public trust. \n Protect the innocent.').group())newLineRegex = re.compile('.*',re.DOTALL)
print(newLineRegex.search('Server the public trust. \n Protect the innocent.').group())

（13）不区分大小写的匹配

import re
robocop = re.compile(r'robocop',re.IGNORECASE) # re.I也可以print(robocop.search('Robocop').group())
print(robocop.search('Robocop').group())
print(robocop.search('robocop').group())