第八章文本数据

1. str对象
2. 正则表达式基础
3. 文本处理的五类操作
- 3.1 拆分
- 3.2 合并
- 3.3 匹配
- 3.4 替换
- 3.4 提取
4. 常用字符串函数
- 4.1 字母型函数
- 4.2 数值型函数
- 4.3 统计型函数
- 4.4 格式型函数
5. 练一练
- Ex1：房屋信息数据集
- Ex2：《权力的游戏》剧本数据集

1. str对象

import numpy as np
import pandas as pdvar = 'abcd'
str.upper(var)

'ABCD'

s = pd.Series(['0abcd', 'efg', 'hi'])
s.str.upper()

0    0ABCD
1      EFG
2       HI
dtype: object

var[-1:0:-2]

'db'

s.str[-1:0:-2]  # 分别对每个字符串做var[-1:0:-2]

0    db
1     g
2     i
dtype: object

s.str[2]  # 超出范围则返回nan

0      b
1      g
2    NaN
dtype: object

# string 类型
s = pd.Series([{1:'temp_1',2:'temp_2'}, ['a','b'],0.5,'my_string'])
s

0    {1: 'temp_1', 2: 'temp_2'}
1                        [a, b]
2                           0.5
3                     my_string
dtype: object

s.str[1]

0    temp_1
1         b
2       NaN
3         y
dtype: object

s.astype('string').str[1]

0    1
1    '
2    .
3    y
dtype: string

s.astype('string')

0    {1: 'temp_1', 2: 'temp_2'}
1                    ['a', 'b']
2                           0.5
3                     my_string
dtype: string

2. 正则表达式基础

import re
re.findall('Apple','Apple! This Is an Apple!')

['Apple', 'Apple']

# 'abc' ---> 'a', 'b', 'c'
re.findall(r'.', 'abc')# 'abc' ---> 'a', 'c'
re.findall(r'[ac]', 'abc')  # [ ] 字符类，匹配方括号中包含的任意字符# 'abc' ---> 'b'
re.findall(r'[^ac]', 'abc')  # [^] 否定字符类，匹配方括号中不包含的任意字符# 'aaaabbbb' ---> 'aa', 'aa', 'bb', 'bb'
re.findall(r'[ab]{2}', 'aaaabbbb')  # {n,m} 匹配前面字符至少n次，但不超过m次# 'aaaabbbb' ---> 'aaa', 'bbb'
# re.findall('[ab]{3}', 'aaaabbbb') # ['aaa', 'abb']
re.findall(r'aaa|bbb', 'aaaabbbb')  # | 匹配符号前或符号后的值# 'aa?a*a' ---> 'a?', 'a*'
re.findall(r'a\?|a\*', 'aa?a*a')  # \ 转义符，可以还原元字符原来的含义# 'abaacadaae' ---> 'ab', 'aa', 'c', 'ad', 'aa', 'e'
re.findall(r'a?.', 'abaacadaae')  # ? 匹配前面字符0或1次  # .匹配除换行符意外的任意字符

['ab', 'aa', 'c', 'ad', 'aa', 'e']

# 'Apple! This Is an Apple!' ---> 'is', 'Is'
re.findall(r'.s','Apple! This Is an Apple!') # .匹配除换行符意外的任意字符# '09 8? 7w c_ 9q p@' ---> '09', '7w', 'c_', '9q'
re.findall(r'\w{2}','09 8? 7w c_ 9q p@')  # \w匹配所有字母数字下划线  # {n,m}匹配前面字符至少n次，但是不超过m次# '09 8? 7w c_ 9q p@' ---> '8?', 'p@'
re.findall(r'\w\W\B','09 8? 7w c_ 9q p@')  # \w匹配所有字母数字下划线 \W匹配非字母和数字的字符 \B匹配一组非空字符开头或结尾的位置，不代表具体字符# 'Constant dropping wears the stone.' ---> 't d', 'g w', 's t', 'e s'
re.findall(r'\w\s\w','Constant dropping wears the stone.')
re.findall(r'.\s.', 'Constant dropping wears the stone.')# '上海市黄浦区方浜中路249号 上海市宝山区密山路5号' ---> ('黄浦区', '方浜中路', '249号'), ('宝山区', '密山路', '5号')
re.findall(r'上海市(.{2,3}区)(.{2,3}路)(\d+号)','上海市黄浦区方浜中路249号 上海市宝山区密山路5号')  # \d 匹配数字，+匹配前面子表达式一次或多次

[('黄浦区', '方浜中路', '249号'), ('宝山区', '密山路', '5号')]

3. 文本处理的五类操作

3.1 拆分

# 参数：正则表达式-最大拆分次数n-是否展开为多个列expand
s = pd.Series(['上海市黄浦区方浜中路249号', '上海市宝山区密山路5号'])
s.str.split('[市区路]')

0    [上海, 黄浦, 方浜中, 249号]
1       [上海, 宝山, 密山, 5号]
dtype: object

s.str.split('[市区路]', n=2, expand=True)

	0	1	2
0	上海	黄浦	方浜中路249号
1	上海	宝山	密山路5号

3.2 合并

s = pd.Series([['a','b'], [1, 'a'], [['a', 'b'], 'c']])
s

0         [a, b]
1         [1, a]
2    [[a, b], c]
dtype: object

s.str.join('-')  # 用连接符-将Series中的字符串列表连接起来

0    a-b
1    NaN
2    NaN
dtype: object

s1 = pd.Series(['a','b'])
s2 = pd.Series(['cat','dog'])
s1.str.cat(s2, sep='-')  # sep 连接符

0    a-cat
1    b-dog
dtype: object

s2.index = [1,2]
s1.str.cat(s2, sep='-', na_rep='?', join='outer')  # 缺失值替代符号 na_rep，连接形式 join

0      a-?
1    b-cat
2    ?-dog
dtype: object

3.3 匹配

# str.contains 返回每个字符串是否包含正则模式的布尔序列
s = pd.Series(['my cat', 'he is fat', 'railway station'])
s.str.contains('\s\wat')  # \s匹配空格 \w匹配所有字母、数字、下划线

0     True
1     True
2    False
dtype: bool

# str.startswith 和 str.endswith 返回了每个字符串以给定模式为开始和结束的布尔序列
# 均不支持正则表达式
s.str.startswith('my')

0     True
1    False
2    False
dtype: bool

# 想使用正则表达式检测开始/结束字符串的话，可以使用str.match
s.str.match('m|h')  # 匹配m或h开头

0     True
1     True
2    False
dtype: bool

s.str[::-1]  # 反转字符串# 0             tac ym
# 1          taf si eh
# 2    noitats yawliar
# dtype: objects.str[::-1].str.match('ta[f|g]|n')  # 匹配[f|g]at 或 n结束

0    False
1     True
2     True
dtype: bool

s.str.contains('^[m|h]')
# 注意 不要和[^]搞混，[^]是匹配方括号中不包含的任意字符
# ^是匹配行的开始

0     True
1     True
2    False
dtype: bool

s.str.contains('[f|g]at|n$')  # $匹配行的结束

0    False
1     True
2     True
dtype: bool

# str.find 返回从左到右第一次匹配的位置的索引
s = pd.Series(['This is an apple. That is not an apple.'])
s.str.find('apple')  # 索引从0起始，空格也算

0    11
dtype: int64

# str.rfind 返回从右到左第一次匹配的位置的索引
s.str.rfind('apple')

0    33
dtype: int64

3.4 替换

str.replace 和 replace 并不是一个函数，在使用字符串替换时应当使用前者

s = pd.Series(['a_1_b','c_?'])
s.str.replace('\d|\?','new',regex=True)  # \d匹配数字，\转义符

0    a_new_b
1      c_new
dtype: object

# 对不同部分进行有差别替换
s = pd.Series(['上海市黄浦区方浜中路249号', '上海市宝山区密山路5号', '北京市昌平区北农路2号'])
pat = '(\w+市)(\w+区)(\w+路)(\d+号)'  # \w匹配所有字母、数字、下划线  # \d匹配数字: [0-9]
city = {'上海市':'Shanghai', '北京市':'Beijing'}
district = {'昌平区': 'CP District',  '黄浦区': 'HP District', '宝山区': 'BS District'}
road = {'方浜中路': 'Mid Fangbin Road', '密山路': 'Mishan Road', '北农路': 'Beinong Road'}

def my_func(m):str_city = city[m.group(1)]  # group(k)代表匹配到的第k个子组str_district = district[m.group(2)]str_road = road[m.group(3)]str_no = 'No. ' + m.group(4)[:-1]return ' '.join([str_city,str_district,str_road,str_no])

s.str.replace(pat, my_func, regex=True)

0    Shanghai HP District Mid Fangbin Road No. 249
1           Shanghai BS District Mishan Road No. 5
2           Beijing CP District Beinong Road No. 2
dtype: object

# 使用命名子组，清晰写出子组代表的含义# ?P<value>的意思就是命名一个名字为value的组
pat = '(?P<市名>\w+市)(?P<区名>\w+区)(?P<路名>\w+路)(?P<编号>\d+号)'def my_func(m):str_city = city[m.group('市名')]str_district = district[m.group('区名')]str_road = road[m.group('路名')]str_no = 'No. ' + m.group('编号')[:-1]return ' '.join([str_city,str_district,str_road,str_no])

s.str.replace(pat, my_func, regex=True)

0    Shanghai HP District Mid Fangbin Road No. 249
1           Shanghai BS District Mishan Road No. 5
2           Beijing CP District Beinong Road No. 2
dtype: object

3.4 提取

pat = '(\w+市)(\w+区)(\w+路)(\d+号)'
s.str.extract(pat)

	0	1	2	3
0	上海市	黄浦区	方浜中路	249号
1	上海市	宝山区	密山路	5号
2	北京市	昌平区	北农路	2号

pat = '(?P<市名>\w+市)(?P<区名>\w+区)(?P<路名>\w+路)(?P<编号>\d+号)'
s.str.extract(pat)

	市名	区名	路名	编号
0	上海市	黄浦区	方浜中路	249号
1	上海市	宝山区	密山路	5号
2	北京市	昌平区	北农路	2号

# str.extractall 将所有符合条件的模式全部匹配出来
# 如果存在多个结果，则以多级索引方式存储
s = pd.Series(['A135T15,A26S5','B674S2,B25T6'], index = ['my_A','my_B'])
s

my_A    A135T15,A26S5
my_B     B674S2,B25T6
dtype: object

pat = '[A|B](\d+)[T|S](\d+)'
s.str.extractall(pat)

		0	1
	match
my_A	0	135	15
my_A	1	26	5
my_B	0	674	2
my_B	1	25	6

pat_with_name = '[A|B](?P<name1>\d+)[T|S](?P<name2>\d+)'
s.str.extractall(pat_with_name)

		name1	name2
	match
my_A	0	135	15
my_A	1	26	5
my_B	0	674	2
my_B	1	25	6

s.str.findall(pat)

my_A    [(135, 15), (26, 5)]
my_B     [(674, 2), (25, 6)]
dtype: object

4. 常用字符串函数

4.1 字母型函数

s = pd.Series(['lower', 'CAPITALS', 'this is a sentence', 'SwApCaSe'])
# 全部大写
s.str.upper()
# 全部小写
s.str.lower()
# 标题格式
s.str.title()
# 首字母大写
s.str.capitalize()
# 大小写交换
s.str.swapcase()

0                 LOWER
1              capitals
2    THIS IS A SENTENCE
3              sWaPcAsE
dtype: object

4.2 数值型函数

pd.to_numeric对字符格式的数值进行快速转换和筛选

errors 非数值的处理模式。raise-直接报错，coerce-设为缺失，ignore-保持原来的字符串
downcast 非数值的转换类型

s = pd.Series(['1', '2.2', '2e', '??', '-2.1', '0'])
pd.to_numeric(s, errors='ignore')

0       1
1     2.2
2      2e
3      ??
4    -2.1
5       0
dtype: object

pd.to_numeric(s, errors='coerce')

0    1.0
1    2.2
2    NaN
3    NaN
4   -2.1
5    0.0
dtype: float64

# 快速查看非数值型的行
s[pd.to_numeric(s, errors='coerce').isna()]

2    2e
3    ??
dtype: object

4.3 统计型函数

count 返回出现正则模式的次数
len 返回出现正则模式的字符串长度

s = pd.Series(['cat rat fat at', 'get feed sheet heat'])

s.str.count('[r|f]at|ee') # 注意这里的|分割的前面整体和后面整体

0    2
1    2
dtype: int64

s.str.len()

0    14
1    19
dtype: int64

4.4 格式型函数

除空型 strip-去除两侧空格 rstrip-去除右侧空格 lstrip-去除左侧空格
填充型 pad

my_index = pd.Index([' col1', 'col2 ', ' col3 '])# 去除两侧空格
my_index.str.strip().str.len()
# 去除右侧
my_index.str.rstrip().str.len()
# 去除左侧
my_index.str.lstrip().str.len()

Int64Index([4, 5, 5], dtype='int64')

# 左侧填充
s = pd.Series(['a','b','c'])
s.str.pad(5, 'left', '*')# 右侧填充
s.str.pad(5, 'right', '*')# 左右两侧填充
s.str.pad(5, 'both', '*')

0    **a**
1    **b**
2    **c**
dtype: object

# 数字前补0的三种写法
s = pd.Series([7, 155, 303000]).astype('string')
# 写法1
s.str.pad(5, 'left', '0')  # 超出数字不处理
s.str.pad(6, 'left', '0')
# 写法2
s.str.rjust(6,'0')
# 写法3
s.str.zfill(6)

0    000007
1    000155
2    303000
dtype: string

5. 练一练

Ex1：房屋信息数据集

我的答案

df = pd.read_excel('data/house_info.xls', usecols=['floor','year','area','price'])
df.head()

	floor	year	area	price
0	高层（共6层）	1986年建	58.23㎡	155万
1	中层（共20层）	2020年建	88㎡	155万
2	低层（共28层）	2010年建	89.33㎡	365万
3	低层（共20层）	2014年建	82㎡	308万
4	高层（共1层）	2015年建	98㎡	117万

# 将 year 列改为整数年份存储
df['year'] = df['year'].str.replace('年建','')

# 将 floor 列替换为 Level, Highest 两列，其中的元素分别为 string 类型的层类别（高层、中层、低层）与整数类型的最高层数
df[['Level','Highest']]=df['floor'].str.split('（共',expand=True)
df['Highest'] = df['Highest'].str.replace('层）','')
df = df.drop('floor',1)
df.head()

	year	area	price	Level	Highest
0	1986	58.23㎡	155万	高层	6
1	2020	88㎡	155万	中层	20
2	2010	89.33㎡	365万	低层	28
3	2014	82㎡	308万	低层	20
4	2015	98㎡	117万	高层	1

# 计算房屋每平米的均价 avg_price ，以 ***元/平米 的格式存储到表中，其中 *** 为整数
df['price'] = pd.to_numeric(df['price'].str.replace('万',''),errors='coerce')
df['price'] = df['price']*10000
area_num = pd.to_numeric(df['area'].str.replace('㎡',''),errors='coerce')
df['avg_price'] = df['price']/area_num
df['avg_price'] = df['avg_price'].apply(lambda x:str(int(x))+'元/平米')
df

	year	area	price	Level	Highest	avg_price
0	1986	58.23㎡	1550000.0	高层	6	26618元/平米
1	2020	88㎡	1550000.0	中层	20	17613元/平米
2	2010	89.33㎡	3650000.0	低层	28	40859元/平米
3	2014	82㎡	3080000.0	低层	20	37560元/平米
4	2015	98㎡	1170000.0	高层	1	11938元/平米
...	...	...	...	...	...	...
31563	2010	391.13㎡	100000000.0	中层	39	255669元/平米
31564	2006	283㎡	26000000.0	高层	54	91872元/平米
31565	2011	245㎡	25000000.0	高层	16	102040元/平米
31566	2006	284㎡	35000000.0	高层	62	123239元/平米
31567	2008	224㎡	23000000.0	低层	22	102678元/平米

31568 rows × 6 columns

参考答案

# 第一问
df.year = pd.to_numeric(df.year.str[:-2]).astype('Int64')# 第二问
# extract 写法不够熟悉
pat = '(\w层)（共(\d+)层）'
new_cols = df.floor.str.extract(pat).rename(columns={0:'Level', 1:'Highest'})
df = pd.concat([df.drop(columns=['floor']), new_cols], 1)# 第三问
s_area = pd.to_numeric(df.area.str[:-1])  # 这样写好简便
s_price = pd.to_numeric(df.price.str[:-1])
df['avg_price'] = ((s_price/s_area)*10000).astype('int').astype('string') + '元/平米'

Ex2：《权力的游戏》剧本数据集

参考答案
这道题我智障了…

我以为1-2问中的台词数是要统计Sentence每行的句子数，然后再求和得到一共多少句台词。想多了。
第3问没有受到影响

# 第一问
df = pd.read_csv('data/script.csv')
df.columns = df.columns.str.strip()
df.groupby(['Season', 'Episode'])['Sentence'].count().head()# 第二问
df.set_index('Name').Sentence.str.split().str.len().groupby('Name').mean().sort_values(ascending=False).head()# 第三问
s = pd.Series(df.Sentence.values, index=df.Name.shift(-1)) # 这里写得好棒，太简洁了，我自己写得好麻烦
s.str.count('\?').groupby('Name').sum().sort_values(ascending=False).head()

我的答案

df = pd.read_csv('data/script.csv')
df.head(3)

	Release Date	Season	Episode	Episode Title	Name	Sentence
0	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	What do you expect? They're savages. One lot s...
1	2011-04-17	Season 1	Episode 1	Winter is Coming	will	I've never seen wildlings do a thing like this...
2	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	How close did you get?

df['Sentence'] = df['Sentence'].str.capitalize()df[df['Name']=='slave buyer']  # 这个人台词是数字开头...

	Release Date	Season	Episode	Episode Title	Name	Sentence	count	word_count
16187	2015-05-24	Season 5	Episode 7	The Gift	slave buyer	20.	0	1

# 计算每一个 Episode 的台词条数
df['count'] = df['Sentence'].str.count('[A-Z]')  # 统计大写字母个数
df.columns
res = df.groupby('Episode ')['count'].count().to_frame()
# 我说怎么报错呢...原来Episode后面有个空格...
res.head()

	count
Episode
Episode 1	2637
Episode 10	1846
Episode 2	2957
Episode 3	2648
Episode 4	2356

# 以空格为单词的分割符号，请求出单句台词平均单词量最多的前五个人
# 以name groupyby,总单词量/总台词数
df.head()

	Release Date	Season	Episode	Episode Title	Name	Sentence	count	word_count
0	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	What do you expect? they're savages. one lot s...	1	25
1	2011-04-17	Season 1	Episode 1	Winter is Coming	will	I've never seen wildlings do a thing like this...	1	21
2	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	How close did you get?	1	5
3	2011-04-17	Season 1	Episode 1	Winter is Coming	will	Close as any man would.	1	5
4	2011-04-17	Season 1	Episode 1	Winter is Coming	gared	We should head back to the wall.	1	7

# 统计单词数，一个单词后面都会跟一个空格，最后一个句子 最后一个单词结束后没空格，所以要加1
df['word_count'] = df['Sentence'].str.count('\s') + 1

df.head()

	Release Date	Season	Episode	Episode Title	Name	Sentence	count	word_count
0	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	What do you expect? They're savages. One lot s...	3	25
1	2011-04-17	Season 1	Episode 1	Winter is Coming	will	I've never seen wildlings do a thing like this...	2	21
2	2011-04-17	Season 1	Episode 1	Winter is Coming	waymar royce	How close did you get?	1	5
3	2011-04-17	Season 1	Episode 1	Winter is Coming	will	Close as any man would.	1	5
4	2011-04-17	Season 1	Episode 1	Winter is Coming	gared	We should head back to the wall.	1	7

res = df.groupby('Name')[['count','word_count']].sum()

res['word_average'] = res['word_count']/res['count']

res.sort_values('word_average',ascending=False)

	count	word_count	word_average
Name
slave buyer	0	1	inf
male singer	1	109	109.0
slave owner	1	77	77.0
lollys stokeworth	1	62	62.0
manderly	1	62	62.0
...	...	...	...
dothraki woman	1	1	1.0
dornish prince	1	1	1.0
cold	1	1	1.0
little sam	1	1	1.0
doloroud edd	1	1	1.0

564 rows × 3 columns

# 若某人的台词中含有问号，那么下一个说台词的人即为回答者。若上一人台词中含有 n 个问号，则认为回答者回答了 n 个问题，请求出回答最多问题的前五个人。
answer_count = df['Sentence'].str.count('\?').tolist()
answer_count = [0] + answer_count
df['answer_count']=answer_count[:-1]
res = df.groupby('Name')['answer_count'].sum().to_frame()
res.sort_values('answer_count',ascending=False).head()

	answer_count
Name
tyrion lannister	527
jon snow	374
jaime lannister	283
arya stark	265
cersei lannister	246

Datawhale组队学习(Pandas) task8-文本数据相关推荐

Datawhale组队学习(Pandas) task2-pandas基础
写在前面看了很多小伙伴task1的笔记,感觉很棒的同时也深受启发,学习过程不仅仅是教材等资料的理解和重复,更应该是自己的思考.串联.发问.尝试,这样才能学得深刻~ 但因为前者更容易,所以自己常常陷入 ...
Datawhale组队学习-NLP新闻文本分类-TASK06
Task6 基于深度学习的文本分类3 基于深度学习的文本分类学习目标了解Transformer的原理和基于预训练语言模型(Bert)的词表示学会Bert的使用,具体包括pretrain和fine ...
Datawhale组队学习-NLP新闻文本分类-TASK05
Task5 基于深度学习的文本分类2 在上一章节,我们通过FastText快速实现了基于深度学习的文本分类模型,但是这个模型并不是最优的.在本章我们将继续深入. 基于深度学习的文本分类本章将继续学习 ...
Data Whale第20期组队学习 Pandas学习—时序数据
Data Whale第20期组队学习 Pandas学习-时序数据一.时序中的基本对象二.时间戳 2.1 Timestamp的构造与属性 2.2 Datetime序列的生成 2.3 dt对象 2.4 ...
Datawhale组队学习-金融时序数据挖掘实践-Task01数据探索与分析
Datawhale组队学习-金融时序数据挖掘实践-Task01数据探索与分析在二手车交易价格预测之后,本菜鸟又加入了金融时序数据挖掘实践的学习.两个项目都是结构化数据,都着重于对数据本身的探索. ...
Data Whale第20期组队学习 Pandas学习—缺失数据
Data Whale第20期组队学习 Pandas学习-缺失数据一.缺失值的统计和删除 1.1 统计缺失信息 1.2 删除缺失信息二.缺失值的填充和插值 2.1 利用fillna进行填充 2.2 ...
第8期Datawhale组队学习计划
第8期Datawhale组队学习计划马上就要开始啦这次共组织15个组队学习,涵盖了AI领域从理论知识到动手实践的内容按照下面给出的最完备学习路线分类,难度系数分为低.中.高三档,可以按照需要参加 ...
如何用Pandas处理文本数据？
↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过Datawhale干货作者:耿远昊,Datawhale成员,华东师范大学文本数据是指不能参与算 ...
Datawhale组队学习之集成学习——Task 6 Boosting
Datawhale组队学习之集成学习--Task 6 Boosting 一.Boosting方法的基本思路二.Adaboost算法 1.Adaboost基本原理 2.使用sklearn对Adaboo ...
Datawhale组队学习周报（第002周）
Datawhale组队学习周报(第002周) (一)当下本周(02月22日~02月28日),我们正在进行5门开源内容的组队学习.一共建立了6个学习群,参与人数1080人.到目前为止,有4门课开源内容 ...

Datawhale组队学习(Pandas) task8-文本数据

目录

1. str对象

2. 正则表达式基础

3. 文本处理的五类操作

3.1 拆分

3.2 合并

3.3 匹配

3.4 替换

3.4 提取

4. 常用字符串函数

4.1 字母型函数

4.2 数值型函数

4.3 统计型函数

4.4 格式型函数

5. 练一练

Ex1：房屋信息数据集

Ex2：《权力的游戏》剧本数据集

Datawhale组队学习(Pandas) task8-文本数据相关推荐

最新文章

热门文章