R数据科学-第七章使用stringr处理字符串

一、字符串基础

1.字符串长度str_length

>library(stringr)
> str_length(c("a","R for data science",NA))
[1]  1 18 NA

2.字符串组合str_c

> str_c("x","y")
[1] "xy"
> str_c("x","y","z",sep=",")#使用sep参数控制字符串之间的间隔方式。
[1] "x,y,z"
> str_c("abc",NA)#NA会传染
[1] NA
> str_c("abc",str_replace_na(NA))#加入str_replace_na后，NA就不会传染了。
[1] "abcNA"

str_c自动循环短向量，使之与长向量长度相同：

> str_c("Hello ",c("Dancy","Nancy"),"!")
[1] "Hello Dancy!" "Hello Nancy!"

长度为0的对象会被丢弃，这与if结合起来很有用：

name<-c("Dancy","Nancy")
time_of_day<-"morning"
birthday<-FALSE
str_c("Good ",time_of_day," ",name,if(birthday)" and happy birthday!",".")
[1] "Good morning Dancy." "Good morning Nancy."

将字符向量合并为字符串，用collapse参数：

> str_c(c("x","y"),collapse = ",")
[1] "x,y

3.字符串取子集str_sub

str_sub(string,start=,end=),start与end表示取出每个子串的开始和结束的位置。

> str_sub(c("apple","pear","orange"),1,3)
[1] "app" "pea" "ora"
> str_sub(c("apple","pear","orange"),-3,-1)
[1] "ple" "ear" "nge"

可以使用其赋值形式来修改字符串：

> x<-c("apple","pear")
> str_sub(x,1,1)<-str_to_upper(str_sub(x,1,1))
> x
[1] "Apple" "Pear"

4.区域设置

一些函数如转换大小写函数在不同国家和地区其规则也不一样，可以使用locale参数设置区域。

区域设置可以参考ISO 639语言编码标准，语言编码是2到3个字母的缩写。

> str_sort(x,locale="en")
[1] "Apple" "Pear"
> str_sort(x,locale="haw")
[1] "Apple" "Pear"
> str_sort(x,locale="tr")
[1] "Apple" "Pear"

二、使用正则表达式进行模式匹配str_view/str_view_all

1.基础匹配

最简单的匹配是精确匹配字符串：

> str_view(x,"a")

复杂一点的是使用.，它可以匹配除了换行符以外的任意字符：

> str_view(x,".a.")

但是，如果是想匹配一些有特殊含义的字符呢，比如是匹配这个点"."。我们仍然需要转义字符\。但是，由于我们使用字符串来表示正则表达式，所以正则表达式\.的字符串表达式应该是\\.。

> str_view(c("abb","a.c","bef"),"a\\.c")

很不方便的是，我们需要使用\\\\四个反斜杠来匹配一个\

> str_view("a\\b","\\\\")

字符串里面的第一个\是转义的意思，实际上这个字符串是a\b。

2.锚点

默认情况下，正则表达式会匹配字符串的任意位置。有时我们需要在正则表达式中设置锚点，以便从R的开头或末尾进行匹配。

^表示从字符串的开头开始匹配。

$表示从字符串的末尾开始匹配。

> x<-c("apple","banana","orange")
> str_view(x,"^a")

> str_view(x,"a$")

还可以用\b来匹配单词边界。例如，为了避免匹配到summarize,summary,rowsum等，会使用\bsum\b进行搜索。

5.字符类与字符选项

\d:匹配任意数字。

\s:匹配任意空白字符。

[abc]:可以匹配a,b或者c。

[^abc]:可以匹配除了a,b,c外的其他字符。

注意，\d,\s在书写时要写作\\d,\\s。

还可以使用逻辑运算符|来匹配多选项，如：

> str_view(c("grey","gray"),"gr(e|a)y")

6.重复

？：0次或1次。

+：1次或多次。

*：0次或多次。

> x<-"1888 is the longest year in Roman nemerals:MDCCCLXXXVIII"
> str_view(x,"CC?")

> str_view(x,"CC+")

> str_view(x,"C[LX]+")

例如你想匹配所有写法的colour/color：使用colou?r（表示有0个或者1个u）。

还可以精确设置匹配次数：

{n}:匹配n次。

{n,}:匹配n次或多次。

{,m}:最多匹配m次。

{n,m}:匹配n到m次。

默认的匹配方式是贪婪的，即正则表达式会匹配尽量长的字符串：

> str_view(x,"C{2}")

> str_view(x,"C{2,}")

>str_view(x,"C{2,3}")

通过在正则表达式后面加一个？，可以将匹配变为“懒惰的”。

> str_view(x,"C{2,3}?")

7.分组与回溯引用

括号可以定义“分组”，并且通过回溯引用“\1,\2”来引用这些分组。

例如下面可以找到名称中有重复的一对字母的所有水果：

这段正则表达式的意思是：两个点代表两个任意字符，括号括起来代表把它们作为第一组。后面\\1代表引用第一组的内容。所以就是单词中出现两个相邻字母重复了两次的情况会被匹配。

> fruit<-c("banana","coconut","cucumber","jujube","papaya","apple","salal berry")
> str_view(fruit,"(..)\\1")

如果只想显示被匹配的字符，可以加入match=TRUE参数。

三、工具

在这一部分将要学习stringr中的其他函数。

1.匹配检测

1）str_detect

str_detect函数可以确认一个字符向量能否匹配一种模式，它返回一个与输入向量具有同样长度的逻辑向量：

> str_detect(fruit,"e")
[1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

由于逻辑向量中FALSE为0，TRUE为1.这使得当匹配很大的向量时，sum和mean函数很好用：

有多少以字母t开头的常用单词？

> sum(str_detect(words,"^t"))
[1] 65

以元音字母结尾的常用单词的比例是多少？

> mean(str_detect(words,"[aeiou]$"))
[1] 0.2765306

当逻辑条件非常复杂时，相对于建立一个很复杂的正则表达式，使用逻辑运算符将多个str_detect调用组合起来会更容易。

不包含元音字母的所有单词个数：

> sum(!str_detect(words,"[aeiou]"))
[1] 6

str_detect函数的另一种常见用法时选取出包含某种模式的子集：

选取出以x结尾的所有常用单词：（两种方法结果一样）

> words[str_detect(words,"x$")]
>str_subset(words,"x$")
[1] "box" "sex" "six" "tax"

2）str_count

它不是仅仅返回T/F，它会返回字符串中匹配的数量：

> str_count(fruit,"e")
[1] 0 0 1 1 0 1 1

平均意义上，每个常用单词有几个元音字母？

> mean(str_count(words,"[aeiou]"))
[1] 1.991837

可以和mutate函数搭配使用：

统计每个单词的元音字母和辅音字母个数：

> word<-tibble(word=words)
> word %>% mutate(vowels=str_count(word,"[aeiou]"),consonants=str_count(word,"[^aeiou]"))
# A tibble: 980 x 3word     vowels consonants<chr>     <int>      <int>1 a             1          02 able          2          23 about         3          24 absolute      4          45 accept        2          46 account       3          47 achieve       4          38 across        2          49 act           1          2
10 active        3          3
# ... with 970 more rows

2.提取匹配的内容str_extract

我们使用sentences数据集。

> length(sentences)
[1] 720
> head(sentences)
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
[4] "These days a chicken leg is a rare dish."
[5] "Rice is often served in round bowls."
[6] "The juice of lemons makes fine punch."

我们想要找出包含一种颜色的所有句子：

首先制造出正则表达式：

> color<-c("yellow","blue","black","red","pink","purple")
> color_match<-str_c(color,collapse = "|")
> color_match
[1] "yellow|blue|black|red|pink|purple"

首先选出包含颜色的句子，再从中提取出颜色。

> has_color<-str_subset(sentences,color_match)
> matches<-str_extract(has_color,color_match)
> head(matches)
[1] "blue" "pink" "blue" "red"  "red"  "red"

str_extract函数只提取第一个匹配。

我们可以先提取出具有多于一种匹配的所有句子，再进行操作：

> more<-sentences[str_count(sentences,color_match)>1]
> str_view_all(more,color_match)

> str_extract(more,color_match)#只返回第一个元素
[1] "blue"  "black"
> str_extract_all(more,color_match)#返回所有元素，并且会返回一个列表
[[1]]
[1] "blue" "red" [[2]]
[1] "black" "red"
> str_extract_all(more,color_match,simplify = TRUE)#返回一个矩阵。[,1]    [,2]
[1,] "blue"  "red"
[2,] "black" "red"

3.分组匹配

例如：我们想从句子中提取出名词，用“至少有一个非空格字符的字符序列”来定义单词，然后找出跟在a或者the后面的所有单词。

> noun<-"(a|the) ([^ ]+)" #第一组：a或者the；一个空格；第二组：一个或多个非空格字符。
> has_noun<-sentences %>% str_subset(noun) %>% head(10)
> str_extract(has_noun,noun)#给出完整匹配[1] "the smooth" "the sheet"  "the depth" [4] "a chicken"  "the parked" "the sun"   [7] "the huge"   "the ball"   "the woman"
[10] "a helps"
> str_match(has_noun,noun)#返回一个矩阵，第一列是完整匹配，后面的列是每个组的匹配。[,1]         [,2]  [,3]     [1,] "the smooth" "the" "smooth" [2,] "the sheet"  "the" "sheet"  [3,] "the depth"  "the" "depth"  [4,] "a chicken"  "a"   "chicken"[5,] "the parked" "the" "parked" [6,] "the sun"    "the" "sun"    [7,] "the huge"   "the" "huge"   [8,] "the ball"   "the" "ball"   [9,] "the woman"  "the" "woman"
[10,] "a helps"    "a"   "helps"

如果想要找出每个字符串的所有匹配，需要使用str_match_all函数。

> str_match_all(has_noun,noun)
[[1]][,1]         [,2]  [,3]
[1,] "the smooth" "the" "smooth"[[2]][,1]        [,2]  [,3]
[1,] "the sheet" "the" "sheet"
[2,] "the dark"  "the" "dark" [[3]][,1]        [,2]  [,3]
[1,] "the depth" "the" "depth"
[2,] "a well."   "a"   "well."[[4]][,1]        [,2] [,3]
[1,] "a chicken" "a"  "chicken"
[2,] "a rare"    "a"  "rare"   [[5]][,1]         [,2]  [,3]
[1,] "the parked" "the" "parked"[[6]][,1]      [,2]  [,3]
[1,] "the sun" "the" "sun"[[7]][,1]        [,2]  [,3]
[1,] "the huge"  "the" "huge"
[2,] "the clear" "the" "clear"[[8]][,1]       [,2]  [,3]
[1,] "the ball" "the" "ball"[[9]][,1]        [,2]  [,3]
[1,] "the woman" "the" "woman"[[10]][,1]           [,2]  [,3]
[1,] "a helps"      "a"   "helps"
[2,] "the evening." "the" "evening."

4.替换匹配内容

str_replace和str_replace_all可以使用新字符替换匹配内容。

使用固定字符串替换匹配内容：

> fruit<-c("apple","banana","pear","melon")
> str_replace(fruit,"[aeiou]","-")
[1] "-pple"  "b-nana" "p-ar"   "m-lon"
> str_replace_all(fruit,"[aeiou]","-")
[1] "-ppl-"  "b-n-n-" "p--r"   "m-l-n"

通过提供一个命名向量，使用str_replace_all可以同时执行多个替换。

> fruit<-c("1 apple","2 banana","3 pear","4 melon")
> str_replace_all(fruit,c("1"="one","2"="two","3"="three","4"="four"))
[1] "one apple"  "two banana" "three pear" "four melon"

还可以使用回溯引用来插入匹配中的分组。下面我们交换了第二个单词和第三个单词的顺序：

> sentences %>%str_replace("([^ ]+) ([^ ]+) ([^ ]+)","\\1 \\3 \\2") %>% head(10)[1] "The canoe birch slid on the smooth planks." [2] "Glue sheet the to the dark blue background."[3] "It's to easy tell the depth of a well."     [4] "These a days chicken leg is a rare dish."   [5] "Rice often is served in round bowls."       [6] "The of juice lemons makes fine punch."      [7] "The was box thrown beside the parked truck."[8] "The were hogs fed chopped corn and garbage."[9] "Four of hours steady work faced us."
[10] "Large in size stockings is hard to sell."

5.拆分

str_split函数可以将字符串拆为很多个部分。

将句子拆为单词：

> sentences %>% str_split(" ") %>% head(5)
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"
[6] "the"     "smooth"  "planks."[[2]]
[1] "Glue"        "the"         "sheet"
[4] "to"          "the"         "dark"
[7] "blue"        "background."[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth"
[7] "of"    "a"     "well."[[4]]
[1] "These"   "days"    "a"       "chicken" "leg"
[6] "is"      "a"       "rare"    "dish."  [[5]]
[1] "Rice"   "is"     "often"  "served" "in"
[6] "round"  "bowls."

设置simplify=TRUE可以返回一个矩阵：

> sentences %>% str_split(" ",simplify = TRUE) %>% head(5)[,1]    [,2]    [,3]    [,4]      [,5]  [,6]
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth"
[4,] "These" "days"  "a"     "chicken" "leg" "is"
[5,] "Rice"  "is"    "often" "served"  "in"  "round"[,7]     [,8]          [,9]    [,10] [,11] [,12]
[1,] "smooth" "planks."     ""      ""    ""    ""
[2,] "blue"   "background." ""      ""    ""    ""
[3,] "of"     "a"           "well." ""    ""    ""
[4,] "a"      "rare"        "dish." ""    ""    ""
[5,] "bowls." ""            ""      ""    ""    ""

还可以设置拆分片段的最大数量：

> sentences %>% str_split(" ",n=3,simplify = TRUE) %>% head(5)[,1]    [,2]
[1,] "The"   "birch"
[2,] "Glue"  "the"
[3,] "It's"  "easy"
[4,] "These" "days"
[5,] "Rice"  "is"   [,3]
[1,] "canoe slid on the smooth planks."
[2,] "sheet to the dark blue background."
[3,] "to tell the depth of a well."
[4,] "a chicken leg is a rare dish."
[5,] "often served in round bowls."

可以通过字母、行、句子和单词边界：boundary()函数来拆分字符串：

str_view_all(sentences %>% head(3),boundary("word"))#根据单词来拆分

6.定位匹配内容

str_locate()和str_locate_all()函数可以给出每个匹配的开始位置和结束位置。

> str_locate_all(sentences %>% head(3),"[aeiou]")
[[1]]start end[1,]     3   3[2,]     6   6[3,]    12  12[4,]    14  14[5,]    15  15[6,]    19  19[7,]    22  22[8,]    27  27[9,]    31  31
[10,]    32  32
[11,]    38  38[[2]]start end[1,]     3   3[2,]     4   4[3,]     8   8[4,]    12  12[5,]    13  13[6,]    17  17[7,]    21  21[8,]    24  24[9,]    30  30
[10,]    31  31
[11,]    34  34
[12,]    39  39
[13,]    40  40[[3]]start end[1,]     6   6[2,]     7   7[3,]    12  12[4,]    15  15[5,]    21  21[6,]    24  24[7,]    29  29[8,]    32  32[9,]    35  35

然后再根据需求用str_sub函数来提取或修改匹配内容。

四、其他类型的模式

regex()函数可以对模式进行包装。

加入ignore_case=TRUE参数后，就既可以匹配大写字母也可以匹配小写字母。

> banana<-c("banana","Banana","BANANA")
> str_view(banana,regex("banana",ignore_case = TRUE))

五、正则表达式的其他应用

apropos函数可以在全局环境空间中搜索所有可用对象。当不能确切想起函数名字时，可以用这个函数：

> apropos("replace")[1] "%+replace%"                    [2] ".rs.registerReplaceHook"       [3] ".rs.replaceBinding"            [4] ".rs.rpc.replace_comment_header"[5] "replace"                       [6] "replace_na"                    [7] "setReplaceMethod"              [8] "str_replace"                   [9] "str_replace_all"
[10] "str_replace_na"
[11] "theme_replace"

dir函数可以列出一个目录下的所有文件。设置参数pattern的值可以只返回与这个模式相匹配的文件名：

> dir(pattern="\\.Rmd$")
character(0)