【362】python 正则表达式

参考：正则表达式 - 廖雪峰

参考：Python3 正则表达式 - 菜鸟教程

参考：正则表达式 - 教程

re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

re.search 扫描整个字符串并返回第一个成功的匹配。

span()：返回搜索的索引区间
group()：返回匹配的结果

re.sub 用于替换字符串中的匹配项。

re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。

Python 的re模块提供了re.sub用于替换字符串中的匹配项。

compile 函数用于编译正则表达式，生成一个正则表达式（ Pattern ）对象，供 match() 和 search() 这两个函数使用。

findall 在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。

注意： match 和 search 是匹配一次 findall 匹配所有。

finditer 和 findall 类似，在字符串中找到正则表达式所匹配的所有子串，并把它们作为一个迭代器返回。

split 方法按照能够匹配的子串将字符串分割后返回列表

\d 可以匹配一个数字；

\d matches any digit, while \D matches any nondigit:

\w 可以匹配一个字母或数字或者下划线；

\w matches any character that can be part of a word (Python identifier), that is, a letter, the underscore or a digit, while \W matches any other character:

\W 可以匹配非数字字母下划线；

\s 表示一个空白格（也包括Tab、回车等空白格）；

\s matches any space, while \S matches any nonspace character:

. 表示任意字符；

* 表示任意字符长度（包括0个）（>=0）；（其前面的一个字符，或者通过小括号匹配多个字符）

# 匹配最左边，即是0个字符
>>> re.search('\d*', 'a123456b')
<_sre.SRE_Match object; span=(0, 0), match=''># 匹配最长
>>> re.search('\d\d\d*', 'a123456b')
<_sre.SRE_Match object; span=(1, 7), match='123456'>>>> re.search('\d\d*', 'a123456b')
<_sre.SRE_Match object; span=(1, 7), match='123456'># 两个的倍数匹配
>>> re.search('\d(\d\d)*', 'a123456b')
<_sre.SRE_Match object; span=(1, 6), match='12345'>

+ 表示至少一个字符（>=1）；（其前面的一个字符，或者通过小括号匹配多个字符）

>>> re.search('.\d+', 'a123456b')
<_sre.SRE_Match object; span=(0, 7), match='a123456'>>>> re.search('(.\d)+', 'a123456b')
<_sre.SRE_Match object; span=(0, 6), match='a12345'>

? 表示0个或1个字符；（其前面的一个字符，或者通过小括号匹配多个字符）

>>> re.search('\s(\d\d)?\s', 'a 12 b')
<_sre.SRE_Match object; span=(1, 5), match=' 12 '>>>> re.search('\s(\d\d)?\s', 'a  b')
<_sre.SRE_Match object; span=(1, 3), match='  '>>>> re.search('\s(\d\d)?\s', 'a 1 b')
# 无返回值，没有匹配成功

[] 匹配，同时需要转义的字符，在里面不需要，如 [.] 表示点

>>> re.search('[.]', 'abcabc.123456.defdef')
<re.Match object; span=(6, 7), match='.'>>>> # 一次匹配中括号里面的任意字符
>>> re.search('[cba]+', 'abcabc.123456.defdef')
<re.Match object; span=(0, 6), match='abcabc'>>>> re.search('.[\d]*', 'abcabc.123456.defdef')
<re.Match object; span=(0, 1), match='a'>>>> re.search('\.[\d]*', 'abcabc.123456.defdef')
<re.Match object; span=(6, 13), match='.123456'>>>> re.search('[.\d]+', 'abcabc.123456.defdef')
<re.Match object; span=(6, 14), match='.123456.'>

{n} 表示n个字符；

{n,m} 表示n-m个字符；

[0-9a-zA-Z\_] 可以匹配一个数字、字母或者下划线；

[0-9a-zA-Z\_]+ 可以匹配至少由一个数字、字母或者下划线组成的字符串，比如'a100'，'0_Z'，'Py3000'等等；

[a-zA-Z\_][0-9a-zA-Z\_]* 可以匹配由字母或下划线开头，后接任意个由一个数字、字母或者下划线组成的字符串，也就是Python合法的变量；

[a-zA-Z\_][0-9a-zA-Z\_]{0, 19} 更精确地限制了变量的长度是1-20个字符（前面1个字符+后面最多19个字符）。

- 在 [] 中表示范围，如果横线挨着中括号则被视为真正的横线
Ranges of letters or digits can be provided within square brackets, letting a hyphen separate the first and last characters in the range. A hyphen placed after the opening square bracket or before the closing square bracket is interpreted as a literal character:

>>> re.search('[e-h]+', 'ahgfea')
<re.Match object; span=(1, 5), match='hgfe'>>>> re.search('[B-D]+', 'ABCBDA')
<re.Match object; span=(1, 5), match='BCBD'>>>> re.search('[4-7]+', '154465571')
<re.Match object; span=(1, 8), match='5446557'>>>> re.search('[-e-gb]+', 'a--bg--fbe--z')
<re.Match object; span=(1, 12), match='--bg--fbe--'>>>> re.search('[73-5-]+', '14-34-576')
<re.Match object; span=(1, 8), match='4-34-57'>

^ 在 [] 中表示后面字符除外的其他字符

Within a square bracket, a caret after placed after the opening square bracket excludes the characters that follow within the brackets:

>>> re.search('[^4-60]+', '0172853')
<re.Match object; span=(1, 5), match='1728'>>>> re.search('[^-u-w]+', '-stv')
<re.Match object; span=(1, 3), match='st'>

A|B 可以匹配A或B，所以(P|p)ython可以匹配'Python'或者'python'。

Whereas square brackets surround alternative characters, a vertical bar separates alternative patterns:

>>> re.search('two|three|four', 'one three two')
<re.Match object; span=(4, 9), match='three'>>>> re.search('|two|three|four', 'one three two')
<re.Match object; span=(0, 0), match=''>>>> re.search('[1-3]+|[4-6]+', '01234567')
<re.Match object; span=(1, 4), match='123'>>>> re.search('([1-3]|[4-6])+', '01234567')
<re.Match object; span=(1, 7), match='123456'>>>> re.search('_\d+|[a-z]+_', '_abc_def_234_')
<re.Match object; span=(1, 5), match='abc_'>>>> re.search('_(\d+|[a-z]+)_', '_abc_def_234_')
<re.Match object; span=(0, 5), match='_abc_'>

^ 表示行的开头，^\d表示必须以数字开头。

$ 表示行的结束，\d$表示必须以数字结束。

A caret at the beginning of the pattern string matches the beginning of the data string; a dollar at the end of the pattern string matches the end of the data string:

>>> re.search('\d*', 'abc')
<re.Match object; span=(0, 0), match=''>>>> re.search('^\d*', 'abc')
<re.Match object; span=(0, 0), match=''>>>> re.search('\d*$', 'abc')
<re.Match object; span=(3, 3), match=''>>>> re.search('^\d*$', 'abc')>>> re.search('^\s*\d*\s*$', ' 345 ')
<re.Match object; span=(0, 5), match=' 345 '>

如果不在最前或最后，可以视为普通字符，但是在最前最后的时候想变成普通字符需要加上反斜杠

Escaping a dollar at the end of the pattern string, escaping a caret at the beginning of the pattern string or after the opening square bracket of a character class, makes dollar and caret lose the special meaning they have in those contexts context and let them be treated as literal characters:

>>> re.search('\$', '$*')
<re.Match object; span=(0, 1), match='$'>>>> re.search('\^', '*^')
<re.Match object; span=(1, 2), match='^'>>>> re.search('[\^]', '^*')
<re.Match object; span=(0, 1), match='^'>>>> re.search('[^^]', '^*')
<re.Match object; span=(1, 2), match='*'>

^(\d{3})-(\d{3,8})$ 分别定义了两个组，可以直接从匹配的字符串中提取出区号和本地号码：

group(0)：永远是原始字符串；
group(1)：表示第1个子串；
group(2)：表示第2个子串，以此类推。

分组顺序：按照左括号的顺序开始

Parentheses allow matched parts to be saved. The object returned by re.search() has a group() method that without argument, returns the whole match and with arguments, returns partial matches; it also has a groups()method that returns all partial matches:

>>> R = re.search('((\d+) ((\d+) \d+)) (\d+ (\d+))','  1 23 456 78 9 0 ')>>> R
<re.Match object; span=(2, 15), match='1 23 456 78 9'>>>> R.group()
'1 23 456 78 9'>>> R.groups()
('1 23 456', '1', '23 456', '23', '78 9', '9')>>> [R.group(i) for i in range(len(R.groups()) + 1)]
['1 23 456 78 9', '1 23 456', '1', '23 456', '23', '78 9', '9']

?: 二选一，括号不计入分组

>>> R = re.search('([+-]?(?:0|[1-9]\d*)).*([+-]?(?:0|[1-9]\d*))',' a = -3014, b = 0 ')>>> R
<re.Match object; span=(5, 17), match='-3014, b = 0'>>>> R.groups()
('-3014', '0')

.* 表示任意匹配除换行符（\n、\r）之外的任何单个或多个字符

模式	描述
^	匹配字符串的开头
$	匹配字符串的末尾。
.	匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。
[...]	用来表示一组字符,单独列出：[amk] 匹配 'a'，'m'或'k'
[^...]	不在[]中的字符：[^abc] 匹配除了a,b,c之外的字符。
re*	匹配0个或多个的表达式。
re+	匹配1个或多个的表达式。
re?	匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式
re{ n}	匹配n个前面表达式。例如，"o{2}"不能匹配"Bob"中的"o"，但是能匹配"food"中的两个o。
re{ n,}	精确匹配n个前面表达式。例如，"o{2,}"不能匹配"Bob"中的"o"，但能匹配"foooood"中的所有o。"o{1,}"等价于"o+"。"o{0,}"则等价于"o*"。
re{ n, m}	匹配 n 到 m 次由前面的正则表达式定义的片段，贪婪方式
a\| b	匹配a或b
(re)	匹配括号内的表达式，也表示一个组
(?imx)	正则表达式包含三种可选标志：i, m, 或 x 。只影响括号中的区域。
(?-imx)	正则表达式关闭 i, m, 或 x 可选标志。只影响括号中的区域。
(?: re)	类似 (...), 但是不表示一个组
(?imx: re)	在括号中使用i, m, 或 x 可选标志
(?-imx: re)	在括号中不使用i, m, 或 x 可选标志
(?#...)	注释.
(?= re)	前向肯定界定符。如果所含正则表达式，以 ... 表示，在当前位置成功匹配时成功，否则失败。但一旦所含表达式已经尝试，匹配引擎根本没有提高；模式的剩余部分还要尝试界定符的右边。
(?! re)	前向否定界定符。与肯定界定符相反；当所含表达式不能在字符串当前位置匹配时成功。
(?> re)	匹配的独立模式，省去回溯。
\w	匹配数字字母下划线
\W	匹配非数字字母下划线
\s	匹配任意空白字符，等价于 [\t\n\r\f]。
\S	匹配任意非空字符
\d	匹配任意数字，等价于 [0-9]。
\D	匹配任意非数字
\A	匹配字符串开始
\Z	匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串。
\z	匹配字符串结束
\G	匹配最后匹配完成的位置。
\b	匹配一个单词边界，也就是指单词和空格间的位置。例如， 'er\b' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。
\B	匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。
\n, \t, 等。	匹配一个换行符。匹配一个制表符, 等
\1...\9	匹配第n个分组的内容。
\10	匹配第n个分组的内容，如果它经匹配。否则指的是八进制字符码的表达式。

举例：

\d{3} ：匹配3个数字

\s+ ：至少有一个空格

\d{3,8} ：3-8个数字

>>> mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'>>> mySent.split(' ')
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']>>> import re>>> listOfTokens = re.split(r'\W*', mySent)>>> listOfTokens
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']>>> [tok for tok in listOfTokens if len(tok) > 0]
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']>>> [tok.lower() for tok in listOfTokens if len(tok) > 0]
['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']>>> [tok.lower() for tok in listOfTokens if len(tok) > 2]
['this', 'book', 'the', 'best', 'book', 'python', 'have', 'ever', 'laid', 'eyes', 'upon']
>>>

参考：python爬虫（5）--正则表达式 - 小学森也要学编程 - 博客园

实现删除引号内部的内容，注意任意匹配使用【.*】

a = 'Sir Nina said: \"I am a Knight,\" but I am not sure'
b = "Sir Nina said: \"I am a Knight,\" but I am not sure"
print(re.sub(r'"(.*)"', '', a),
re.sub(r'"(.*)"', '', b), sep='\n')Output:
Sir Nina said:  but I am not sure
Sir Nina said:  but I am not sure

Example from Eric Martin's learning materials of COMP9021

The following function checks that its argument is a string:

that from the beginning: ^
consists of possibly some spaces: ␣*
followed by an opening parenthesis: \(
possibly followed by spaces: ␣*
possibly followed by either + or -: [+-]?
followed by either 0, or a nonzero digit followed by any sequence of digits: 0|[1-9]\d*
possibly followed by spaces: ␣*
followed by a comma: ,
followed by characters matching the pattern described by 1-7
followed by a closing parenthesis: \)
possibly followed by some spaces: ␣*
all the way to the end: $

Pairs of parentheses surround both numbers to match to capture them. For point 5, a surrounding pair of parentheses is needed; ?: makes it non-capturing:

>>> def validate_and_extract_payoffs(provided_input):pattern = '^ *\( *([+-]?(?:0|[1-9]\d*)) *,'\' *([+-]?(?:0|[1-9]\d*)) *\) *$'match = re.search(pattern, provided_input)if match:return (match.groups())>>> validate_and_extract_payoffs('(+0, -7 )')
('+0', '-7')>>> validate_and_extract_payoffs('  (-3014,0)  ')
('-3014', '0')

转载于:https://www.cnblogs.com/alex-bn-lee/p/10325559.html

【362】python 正则表达式相关推荐

Python 正则表达式各种特殊符号重点
Python 正则表达式正则表达式是一个特殊的字符序列,它能帮助你方便的检查一个字符串是否与某种模式匹配. Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式. r ...
python 正则表达式 re.compile() 的使用
1 re.compile() 的解释: python 正则表达式 re.compile() 将正则表达式编译成一个Pattern规则对象,单独使用compile 没有意义,他生成的是一个规则,需要ma ...
Python正则表达式，看这一篇就够了
作者 | 猪哥来源 | 裸睡的猪(ID: IT--Pig) 大多数编程语言的正则表达式设计都师从Perl,所以语法基本相似,不同的是每种语言都有自己的函数去支持正则,今天我们就来学习 Python中 ...
Python 正则表达式
最近研究Python爬虫,很多地方用到了正则表达式,但是没好好研究,每次都得现查文档.今天就专门看看Python正则表达式.本文参考了官方文档 re模块. 模式首先正则表达式的语法我就不说了,这玩意 ...
Python正则表达式初识（二）
前几天给大家分享了[Python正则表达式初识(一)],介绍了正则表达式中的三个特殊字符"^"."."和"*",感兴趣的伙伴可以戳进去看看, ...
python正则表达式需要模块_使用Python正则表达式模块，让操作更加简单
处理文本数据的一个主要任务就是创建许多以文本为基础的特性. 人们可能想要在文本中找出特定格式的内容,比如找出存在于文本中的电子邮件,或者大型文本中的电话号码. 虽然想要实现上述功能听起来很繁琐,但是如 ...
python正则表达式re.sub用法
python正则表达式re.sub用法 https://cloud.tencent.com/developer/article/1382055 python正则表达式re.sub用法全面的 http ...
【Python】一文读懂Python正则表达式常用用法
点击上方"AI遇见机器学习",选择"星标"公众号重磅干货,第一时间送达编辑:爱学AI 来源:geekvi 链接: www.segmentfault.co ...
Python正则表达式常用的15个符号整理
http://blog.itpub.net/31403259/viewspace-2157778/ Python正则表达式常用的15个符号整理: 1. ? 匹配0次或一次前面的分组(问号在正则表达式中 ...
Python正则表达式使用的四个基本步骤
http://blog.itpub.net/31403259/viewspace-2157701/ Python正则表达式使用的四个基本步骤 1.用import re导入正则表达式模块 ...

【362】python 正则表达式

【362】python 正则表达式相关推荐

最新文章

热门文章