Python 正则表达式基础(详细)

正则表达式(Regular Expression)

正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个"规则字符串"，这个"规则字符串"用来表达对字符串的一种过滤逻辑。正则表达式是一种文本模式，该模式描述在搜索文本时要匹配的一个或多个字符串。它本身是一种小型的、高度专业化的编程语言，在python中，通过内嵌集成re模块，根据代码可以直接调用来实现正则匹配。正则表达式模式被编译成一系列的字节码，然后由用C编写的匹配引擎执行。

import re

python re库文件常用函数介绍：

# 通过以下命令查看简介
help(re.match)
help(re.search)
help(re.findall)
help(re.compile)

Help on function match in module re:match(pattern, string, flags=0)Try to apply the pattern at the start of the string, returninga match object, or None if no match was found.Help on function search in module re:search(pattern, string, flags=0)Scan through string looking for a match to the pattern, returninga match object, or None if no match was found.Help on function findall in module re:findall(pattern, string, flags=0)Return a list of all non-overlapping matches in the string.If one or more capturing groups are present in the pattern, returna list of groups; this will be a list of tuples if the patternhas more than one group.Empty matches are included in the result.Help on function compile in module re:compile(pattern, flags=0)Compile a regular expression pattern, returning a pattern object.

re.match函数作用：
尝试从字符串的开头开始匹配一个模式，如果匹配成功，返回一个匹配成功的对象，否则返回None。

re.match函数参数说明：
pattern：匹配的正则表达式
string：要匹配的字符串
flags：标志位，用于控制正则表达式的匹配方式。如是否区分大小写、是否多行匹配等。

re.search函数作用：
扫描整个字符串并返回第一次成功的匹配对象，如果匹配失败，则返回None。

re.search函数参数说明：
pattern：匹配的正则表达式
string：要匹配的字符串
flags：标志位，用于控制正则表达式的匹配方式。如是否区分大小写、是否多行匹配等。

re.findall函数的作用：
获取整个字符串中所有匹配的字符串，并以列表的形式返回。列表中的元素有如下几种情况：
当正则表达式中含有多个圆括号()时，列表的元素为多个字符串组成的元组，而且元组中字符串个数与括号对数相同，并且字符串排放顺序跟括号出现的顺序一致，当正则表达式中没有圆括号时，列表中的字符串表示整个正则表达式匹配的内容。
这里的()主要是提取相关字符的作用，返回的值为列表形式。
re.findall函数参数说明：
pattern：匹配的正则表达式
string：被分割的字符串
flags：标志位，用于控制正则表达式的匹配方式。如是否区分大小写、是否多行匹配等。

re.compile函数的作用：
返回的是一个匹配对象，单独使用就无任何意义，需要和findall(), search(), match(）搭配使用。

使用方式：

import re
text = 'abcde12345edcba'
regex = re.compile(r'([a-z]*)([0-9]*)([a-z]*)')
result = regex.search(text)
print(result.group(3)) # 返回第三个匹配内容

输出结果：

edcba

match.group([参数1],[参数2],…)

括号表达式用于定义一个group,一个正则表达式中可以有多个括号表达式,这就意味着匹配结果中可能有多个group,通过group函数来返回特定位置的group结果。
参数说明：
返回匹配结果中一个或多个group.如果该group函数仅仅有一个参数,那么返回结果就是单个字符串
如果有多个参数,结果是每一个参数对应的group项的元组.如果没有参数,那么group参数默认为0，也就是返回整个匹配结果的字符串

import re
text =  r'abcde12345edcba'
regex = '([a-z]*)([0-9]*)([a-z]*)'
print(re.search(regex,text).group(0))   # 返回整个匹配结果，这里为全部元素
print(re.search(regex,text).group(1))   # 返回第一个匹配元素
print(re.search(regex,text).group(2))   # 返回第二个匹配元素
print(re.search(regex,text).group(3))   # 返回第三个匹配元素
print(re.search(regex,text).group(1,3)) # 返回第一个匹配元素和第三个匹配元素

输出结果：

abcde12345edcba
abcde
12345
edcba
('abcde12345edcba', 'edcba')

元字符

Special characters

.                匹配任意除去换行字符'\n'的字母
\               转义字符

test1 = '1\\2\\3abc' # 这里用了转义字符
test1_result = re.findall('(1...3)',test1)
print(test1_result)

输出结果：
<_sre.SRE_Match object; span=(0, 5), match='1\\2\\3'>
注意：这里结果还仍然保存'\\'符号，但是它们将作为一个'\'字符

^                匹配字符串的开头
$               匹配字符串的结尾
注：这里的字符串可以是一整空行，这时的字符串的开头和结尾位置相同

test2  = 'Hello World'
test2_result1 = re.findall('^Hello',test2)
test2_result2 = re.findall('World$',test2)
print(test2_result1)
print(test2_result2)

输出结果：
['Hello']
['World']

[1a-e]               匹配1或者a-e,即1,a,b,c,d,e字符
[^2f-j]             匹配所有不为2且f-j,即2,f,g,h,i,j字符

test3 = '1234567890abcdefghij'
test3_result1 = re.findall('[1a-d]',test3)
test3_result2 = re.findall('[^2f-j]',test3)
print(test3_result1)
print(test3_result2)

输出结果：
['1', 'a', 'b', 'c', 'd']
['1', '3', '4', '5', '6', '7', '8', '9', '0', 'a', 'b', 'c', 'd', 'e']

R|S              匹配正则表达式R或者正则表达式S
()                  创建捕获组并指示优先级

test4 = '1234567890 abcde fghij hello world'
test4_result1 = re.findall('[a-e][a-e][a-e][a-e]e|[a-z]e[a-z][a-z][a-z]',test4)
test4_result2 = re.findall('([a-e][a-e][a-e][a-e]e)|([a-z]e[a-z][a-z][a-z])',test4)
print(test4_result1)
print(test4_result2)

结果显示：
['abcde', 'hello']
[('abcde', ''), ('', 'hello')]
这里返回包含元组的列表

Quantifiers

*                0或更大（附加'?'为非贪婪形式,默认为贪婪）
+              1个或更多（附加'?'为非贪婪形式,默认为贪婪）
?               0或1

test5 = '1111100000 aaaaaaabbbbb ccccc'
test5_result1 = re.findall('a*',test5)
test5_result2 = re.findall('a+',test5)
test5_result3 = re.findall('a*?',test5)
test5_result4 = re.findall('a+?',test5)
print(test5_result1)
print(test5_result2)
print(test5_result3)
print(test5_result4)

结果显示：
['', '', '', '', '', '', '', '', '', '', '', 'aaaaaaa', '', '', '', '', '', '', '', '', '', '', '', '']
['aaaaaaa']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
['a', 'a', 'a', 'a', 'a', 'a', 'a']

{m}              匹配前面的字符重复m次
{m, n}          匹配前面的字符重复m到n遍,m默认为0,n为无穷大
{m, n}?         匹配前面的字符重复m到n遍,采用非贪婪

test6 = '1111100000 aaaaaaabbbbbb ccccc'
test6_result1 = re.findall('b{3}',test6)
test6_result2 = re.findall('b{3,5}',test6)
test6_result3 = re.findall('b{4,6}?',test6)
print(test6_result1)
print(test6_result2)
print(test6_result3)

结果显示：
['bbb', 'bbb']
['bbbbb']
['bbbb']

Special sequences


\A          字符串开始
\b          匹配位于开头或者结尾的空字符串
\B          匹配不在单词边界处的空字符串
\d          匹配任意十进制数字
\D          匹配非数字
\s          匹配空白： [\t\n\r\f\v]
\S          匹配非空白
\w          匹配字母数字： [0-9a-zA-Z]
\W          匹配非字母数字

test6 = '10101abcabc %1%%%0%  101010#abc10#10 abc101010abc '
test6_result1 = re.findall('\d+',test6)
test6_result2 = re.findall('\D+',test6)
test6_result3 = re.findall('\w+',test6)
test6_result4 = re.findall('\W+',test6)
test6_result5 = re.findall('\s%1%%',test6)
print(test6_result1)
print(test6_result2)
print(test6_result3)
print(test6_result4)
print(test6_result5)

结果显示：
['10101', '1', '0', '101010', '10', '10', '101010']
['abcabc %', '%%%', '%  ', '#abc', '#', ' abc', 'abc ']
['10101abcabc', '1', '0', '101010', 'abc10', '10', 'abc101010abc']
[' %', '%%%', '%  ', '#', '#', ' ', ' ']
[' %1%%']

推荐两个regex学习网址:

regex101
pythonex