引言

本文主要介绍一下 Python 正则表达式，搜索模式匹配。更多 Python 进阶系列文章，请参考 Python 进阶学习玩转数据系列

内容提要：

re 模块方法汇总
match() vs. search()
常用的正则表达式通配符
用 Raw Strings 原始字符串
MatchObject
findall()
Matching Flags
re.IGNORECASE
re.ASCII
re.DOTALL
re.MULTILINE
re.VERBOSE
字符串操作 re.sub re.split
用 re.compile 更方便
字符串方法和正则匹配
匹配 Email 的例子
正则表达式基本语法应用
简单字符匹配
一些特殊意义的字符
[ ] 中括号 Square brackets
重复通配符 Wildcards match repeated characters
命名提取匹配部分 Naming extracted components

re 模块方法汇总

Method	Description
match (pattern, string, flags)	From `the beginning`, return MatchObject if a match exists, None otherwise
search (pattern, string, flags)	Search `the entire string` for match, return MatchObject if exists, None otherwise
findall (pattern, string, flags)	Return a list of matches of the pattern within string
finditer (pattern, string, flags)	Iterator of matches of patterns in string
fullmatch (pattern, string, flags)	Apply pattern to full string, MatchObject or None returned
split (pattern, string, maxsplit, flags)	Break string up by regex pattern
sub (pattern, repl, string, count, flags)	Find match, replace it with repl. Return new string

match() vs. search()

match() :
● 返回一个 MatchObject 如果 0 或多个字符从字符串起始位置匹配到正则表达式模式
● 返回 None 如果字符串起始位置没有匹配到模式

search() 扫描正个字符串返回:
● 一个相应的 MatchObject
● None 没有找到匹配到的

举例：

Code:

import redef match_search(regex, search_str):if re.match(regex, search_str):print('match: begins with {0}'.format(regex))else:print('match: {0} not found at beginning'.format(regex))if re.search(regex, search_str):print('search: contains {0}'.format(regex))else:print('search: {0} not found within'.format(regex))lor = '''THE LORD OF THE RINGSV*art One THE FELLOWSHIP
OF THE RING J.R.R.ToIkien'''match_search('THE', lor)
match_search('THE LORD', lor)
match_search('LORD', lor)
match_search('ToIkien', lor)regex = re.compile('\s+')for s in ["     ", "abc  ", "  abc"]:if regex.match(s):print(repr(s), "matches")else:print(repr(s), "does not match")

常用的正则表达式通配符

Meta Character	Description
^	from the start
$	to the end
\s	whitespace
\S	non-whitespace
\d	digit
\D	non-digit
\w	alpha-numeric character
\W	non-alpha-numeric character
\b	word boundary
\B	non-word boundary
.	none-line break charcter
*	0 or more characters
+	1 or more characters
?	0 or 1 character
{n}	exactly n characters
{n,m}	from n to m characters
{,m}	up to m characters
`(n	m
[abcd]	a or b or c or d
[f-m]	one of characters from f through m
[^xyz]	not x or y or z
[a-zA-Z]	one of any letters

用 Raw Strings 原始字符串

● \ 反斜杠 backslash ，在字符串中是转义符，而在正则表达式中一个特殊的字符。
● 为了避免混淆反斜杠和转义字符，我们用原始字符串 Raw String
● 用 r’….’ 表示原始字符串，\ 在原始字符串不再是一个特殊字符串字符。

举例：
字符串中 \b 是一个特殊的符号，\\ 是表示字符 \，强调用不是转义符。

MatchObject

match() 或 search() 会返回一个 MatchObject
MatchObject 方法:
● start(n) – 返回特定分组的起始索引
● end(n) – 返回特定分组的终止索引
● span(n) – 返回特定分组的起止索引元组 values (start, end)
● groups() – 返回包含所有子分组的元组
● group(n) – 返回特定子分组的元组, zero is the whole match

举例：

代码：

matchobj = re.search(r'(\w+) (\w+) (\w+) (\w+)',
"Hobbits are an unobtrusive but very ancient people")
print("groups():",matchobj.groups())for i in range(len(matchobj.groups())+1):    print("group({0}): {1}".format(str(i), matchobj.group(i)))print("start({}): {}".format(str(i), matchobj.start(i)))print("end({}): {}".format(str(i), matchobj.end(i)))

findall()

findall() 返回一个 list，是括号所匹配到的结果(如 matches_2)，多个括号就会返回多个括号分别匹配到的结果(如 matches_3)，如果没有括号就返回就返回整条语句所匹配到的结果(如 matches_1)。

第 1 个 regex 中不带有括号, 其输出的内容就是整个表达式所匹配到的内容。
第 2 个 regex 中带有1个括号，其输出的内容就是括号匹配到的内容，而不是整个表达式所匹配到的结果。
第 3 个 regex 中是带有2个括号的，我们可以看到其输出是一个list 中包含 2 个 tuple

Code:

import restring = "Hobbits are an unobtrusive but very ancient people"
matches_1 = re.findall(r'\w+', string)
matches_2 = re.findall(r'(\w+) (\w+) (\w+) (\w+)', string)
matches_3 = re.findall(r'((\w+) (\w+) (\w+) (\w+))', string)
print("{0}\ncontains {1} words: {2}".format(string, len(matches_1), matches_1))
print("{0}\ncontains {1} words: {2}".format(string, len(matches_2), matches_2))
print("{0}\ncontains {1} words: {2}".format(string, len(matches_3), matches_3))

Matching Flags

● re.IGNORECASE - 忽略大小写匹配
● re.ASCII - 只匹配 ASCII，而不是 unicode
● re.VERBOSE - use verbose-style regular expressions
● re.DOTALL - dot(.) 匹配任意字符，包括换行符
● re.MULTILINE - 多行匹配每行的开头或结尾

re.IGNORECASE

re.IGNORECASE 或简写为 re.I ，忽略大小写匹配。

re.ASCII

re.ASCII 或简写为 re.A ，ASCII表示ASCII码的意思，让 w, W, b, B, d, D, s和 S只匹配ASCII，而不是Unicode

re.DOTALL

re.DOTALL 或简写为 re.S，DOT表示 .，ALL表示所有，连起来就是.匹配所有，包括换行符n。默认模式下. 是不能匹配行符 n 的。

re.MULTILINE

re.MULTILINE 或简写为 re.M，多行模式，当某字符串中有换行符 n，默认模式下是不支持换行符特性的，比如：行开头和行结尾，而多行模式下是支持匹配行开头的。

re.VERBOSE

通常正则表达式都是一行，不是很好理解。所以，可以使用详细模式，正则表达式中可以加注解。但是详细模式有区别普通模式：

空格被忽略:
● 空格，tabs 符，回车符都会被胡烈
● 如果需要忽略一个空格，需要用转义符。

注释被忽略:
● 详细模式中，一个注释就像 Python 代码中的注释一样，以 # 开头直到行结束。

代码：

import re
pattern = r'''
(\(?\d{3}\)?)? # optional area code, parentheses optional
[-\s.]?        # optional separator, dash, space, or period
\d{3}          # 3-digit prefix
[-\s.]         # separator: dash, space or period
\d{4}          # final 4-digits
'''
phones = ['123-456-7890','123 456 7890','(123) 456-7890','123.456,7890','123-4567','abc-dfg-7789'
]valid = [ph for ph in phones if re.match(pattern, ph, re.VERBOSE)]
print('VERBOSE: Valid phones: {0}'.format(valid))

字符串操作 re.sub re.split

有两种方法用来处理经过模式匹配后的字符串

newstr = re.sub (pattern, replacement, sourcestring) 替换模式匹配到的字符
re.split (pattern, sourcestring) 用模式匹配到的作为分隔符

举例：

用 re.compile 更方便

如果一个匹配模式需要反复使用，那么用 re.compile(pattern) 更方便

pattern_obj = re.compile(pattern, re.VERBOSE)

举例：

字符串方法和正则匹配

regex.search() 方法和 str.index() 或 str.find() 方法是相同功能的。

regex.sub() 方法像 str.replace()

匹配 Email 的例子

‘\w+@\w+.[a-z]{3}’ \w 只能匹配到字符，数字的字符，邮件里的 . 字符没法匹配到

我们可以用 \S 匹配非空字符。
‘\w+\S\w+@\w+.[a-z]{3}’

正则表达式基本语法应用

简单字符匹配

import relor = open('../Python_data_wrangling/Python_data_wrangling_data_raw/data_raw/LordOfTheRings.txt',encoding='utf-8').read()frodo = re.compile('Frodo')
frodos = frodo.findall(lor)gandalf = re.compile('Gandalf')
gandalfs = gandalf.findall(lor)sauron = re.compile('Sauron')
saurons = sauron.findall(lor)gollum = re.compile('Gollum')
gollums = gollum.findall(lor)print("Frodo is mentioned {} times\nGandalf is mentioned {} times\nSauron is mentioned: {} times\nGollum is mentioned {} times".format(len(frodos),len(gandalfs), len(saurons), len(gollums)))
print("So who is the Lord of the Rings?")

输出：

Frodo is mentioned 1100 times
Gandalf is mentioned 466 times
Sauron is mentioned: 60 times
Gollum is mentioned 72 times
So who is the Lord of the Rings?

一些特殊意义的字符

如： ^ $ * + ? { } [ ] \ | ( )
如果只是想匹配到上面的这些字符，不能直接直接用，需要加上转义符 \

[ ] 中括号 Square brackets

如果内置的字符不够充分，用户可以自定义。可以用破折号表示范围
如 “[a-m]” 匹配小写字符 a 到 m 之间的字符

重复通配符 Wildcards match repeated characters

如果想匹配 5 个字符，可以用 “\w\w\w\w\w” 或 “\w{3}”

Character	Description	Example
?	Match `0 or 1` repetitions of preceding	“ab?” matches “a” or “ab”
*	Match `0 or more` repetitions of preceding	“ab*” matches “a”, “ab”, “abb”, “abbb”…
+	Match `1 or more` repetitions of preceding	“ab+” matches “ab”, “abb”, “abbb”… but not “a”
{n}	Match n repetitions of preeeding	“ab{2}” matches “abb”
{m,n}	Match between m and n repetitions of preceding	“ab{2,3}” matches “abb” or “abbb”

例如：
[\w.]+ 表示 \w 或 . 出现一次或多次，也就能匹配到任意长度字符数字或 .

命名提取匹配部分 Naming extracted components

用 (?P<name>) 来将匹配的值分组成一个字典

code：

import re
email4 = re.compile('(?P<user>[\w.]+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})')
match = email4.match('peter.zhang@gmail.com')
match.groupdict()

Python 玩转数据 19 - 数据操作正则表达式 Regular Expressions 搜索模式匹配相关推荐

python玩转大数据_【小旭学长】大数据博士教你用python玩转时空大数据
好消息!好消息!手把手教你用python玩大数据小旭学长的python大数据教程完结撒花,共26P录制完毕,总时长4小时.每10分钟的视频的录制加剪辑时间加起来都要两小时以上,讲得很细但是节奏也很快 ...
如何使用Python玩转PDF各种骚操作？
点击"小詹学Python",选择"置顶"公众号重磅干货,第一时间送达本文转载自Python数据科学,禁二次转载 Portable Document Form ...
python怎么玩pdf_如何使用Python玩转PDF各种骚操作？
Portable Document Format(可移植文档格式),或者PDF是一种文件格式,可以用于跨操作系统的呈现和文档交换.尽管PDF最初是由Adobe发明的,但它现在是由国际标准化组织(ISO ...
UltraEdit正则表达式使用（Regular Expressions in UltraEdit）
正则表达式作为模式匹配,经常用于查找/替换操作的特定字符串.使用正则表达式来简化操作和提高效率的方式有许多.下面列出了一个用于ultra - edit样式和unix样式正则表达式的参考以及一些示例,演 ...
Python玩转大数据-张敏-专题视频课程
Python玩转大数据-221人已学习课程介绍该课程采用时下后的编程语言Python讲解,囊括了当前火的大数据技术Spark/Hadoop/Hive知识,学习环境是基于Docker ...
大学python选择题题库及答案_大学慕课用Python玩转数据题库及答案
大学慕课用Python玩转数据题库及答案更多相关问题 (19分)电解原理在化学工业中有广泛应用.右图表示一个电解池,装有电解液c :A.B是两块电极板,通过导线与直流用铂电极电解CuCl2与CuS ...
python与excel做数据可视化-我在工作中是怎么玩数据的—数据可视化系列教程—Python篇...
一. 为什么是Python? Python现在已经成为数据科学的语言!基于 Python 代码实现批量化,流程化的数据探索与汇报!按照地产大佬***的话讲--就是重复性的工作直接用Python搞定就可 ...
python数据框常用操作_转载：python数据框的操作
我们接着上次分享给大家的两篇文章:Python数据分析之numpy学习(一)和Python数据分析之numpy学习(二),继续讨论使用Python中的pandas模块进行数据分.在接下来的两期pand ...
用python玩转数据慕课答案第四周_大学慕课用Python玩转数据章节测试答案
大学慕课用Python玩转数据章节测试答案更多相关问题渗透泵型片剂控释的基本原理是A．减小溶出B．减慢扩散C．片剂膜外渗透压大于片剂膜内,将片内药物从语义学批评是什么? As usual, __ ...
python数据预测代码_手把手教你用Python玩转时序数据，从采样、预测到聚类丨代码...
原标题:手把手教你用Python玩转时序数据,从采样.预测到聚类丨代码原作 Arnaud Zinflou 郭一璞编译时序数据,也就是时间序列的数据. 像股票价格.每日天气.体重变化这一类,都是时 ...

Python 玩转数据 19 - 数据操作正则表达式 Regular Expressions 搜索模式匹配

引言

re 模块方法汇总

match() vs. search()

常用的正则表达式通配符

用 Raw Strings 原始字符串

MatchObject

findall()

Matching Flags

re.IGNORECASE

re.ASCII

re.DOTALL

re.MULTILINE

re.VERBOSE

字符串操作 re.sub re.split

用 re.compile 更方便

字符串方法和正则匹配

匹配 Email 的例子

正则表达式基本语法应用

简单字符匹配

一些特殊意义的字符

[ ] 中括号 Square brackets

重复通配符 Wildcards match repeated characters

命名提取匹配部分 Naming extracted components

Python 玩转数据 19 - 数据操作正则表达式 Regular Expressions 搜索模式匹配相关推荐

最新文章

热门文章

Python 玩转数据 19 - 数据操作 正则表达式 Regular Expressions 搜索模式匹配

引言

re 模块方法汇总

match() vs. search()

常用的正则表达式通配符

用 Raw Strings 原始字符串

MatchObject

findall()

Matching Flags

re.IGNORECASE

re.ASCII

re.DOTALL

re.MULTILINE

re.VERBOSE

字符串操作 re.sub re.split

用 re.compile 更方便

字符串方法 和 正则匹配

匹配 Email 的例子

正则表达式基本语法应用

简单字符匹配

一些特殊意义的字符

[ ] 中括号 Square brackets

重复通配符 Wildcards match repeated characters

命名提取匹配部分 Naming extracted components

Python 玩转数据 19 - 数据操作 正则表达式 Regular Expressions 搜索模式匹配相关推荐

最新文章

热门文章

Python 玩转数据 19 - 数据操作正则表达式 Regular Expressions 搜索模式匹配

字符串方法和正则匹配

Python 玩转数据 19 - 数据操作正则表达式 Regular Expressions 搜索模式匹配相关推荐