文章目录

  • dplyr介绍
    • 安装
    • 数据集:starwars
    • 针对单个数据集的操作
      • filter()根据条件筛选行
      • arrange()进行排序
      • slice()根据位置选择行
      • select()选择列
      • mutate()新建列
      • relocate()重排列的位置
      • summarise()汇总
  • grouped data
    • group_by()
    • 查看分组信息
      • 增加或改变用于聚合的变量
      • 移除聚合的变量
    • 联合使用
      • summarise()
      • `select()`/`rename()`/`relocate()`
      • arrange()
      • `muatate()` and `transmutate()`
      • filter()
    • Computing on grouping information
      • cur_data()
      • cur_group() and cur_group_id()
  • two-table verbs
    • 合并连接
    • 筛选连接
    • 集合操作
    • 合并连接
    • 筛选连接
    • 集合操作
  • column-wise operations
      • 陷阱
    • across其他连用
    • 和filter()连用
  • row-wide operations
    • 简介
    • 对行进行汇总统计
    • list columns
      • motivation
      • subsetting
      • modeling
    • repeated function calls
      • simulations
      • multiple combinations
      • varying functions

dplyr介绍

tidyverse系列应该算是R语言数据分析中的瑞士军刀了,统一的格式,简洁的代码,管道符便于阅读的形式,都能让大家快速上手。R数据科学就是专门讲这个系列的,但是对于很多函数的用法和细节问题,都没有说,所以在使用时还是会经常遇到各种问题。

我根据R数据科学和tidyverse官网的教程,整理了几篇笔记,主要是对tidyverse的各种函数的用法进行详细的演示。

前面已经介绍过了forcats包处理因子型数据,lubridate包处理日期时间格式数据。

下面介绍dplyr包。

在处理数据时,要明确以下几个问题:

  • 明确你的目的
  • 用计算机程序的方式描述你的任务
  • 执行程序

dplyr包可以帮你又快又简单地处理这些问题。tidyr包主要聚焦于把数据变成整洁数据,dplyr包主要功能在于对整洁数据进行各种操作,比如新增、筛选、汇总、合并等。

安装

install.packages("tidyverse")

数据集:starwars

下面使用*星战(starwars)*数据集演示基本的dplyr用法。

starwars数据集共有87行,14列,记录了星战里面的87个人物(机器人、外星人等等)的14个特点,比如姓名、身高、体重、头发颜色、眼睛颜色、种族等。

library(dplyr)
##
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
dim(starwars)
## [1] 87 14
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or~
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2~
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.~
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N~
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "~
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",~
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, ~
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",~
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini~
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T~
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma~
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return~
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp~
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",~

针对单个数据集的操作

今天介绍的都是主要针对单个数据集进行操作的各种函数,也是最常见的类型。

根据作用方式不同,大致可以分为以下3类:

  • 作用于行

    • filter()
    • slice()
    • arrange()
  • 作用于列
    • select()
    • rename()
    • mutate()
    • relocate()
  • 作用于一组数据
    • summarise()

filter()根据条件筛选行

filter()函数用于筛选符合条件的行,可以用各种表达式进行筛选,比如筛选眼睛颜色是brown并且皮肤颜色是light的行,注意这里不需要使用 & 符号:

starwars %>% filter(skin_color == "light", eye_color == "brown")
## # A tibble: 7 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>
## 1 Leia Or~    150    49 brown      light      brown             19 female femin~
## 2 Biggs D~    183    84 black      light      brown             24 male   mascu~
## 3 Cordé       157    NA brown      light      brown             NA female femin~
## 4 Dormé       165    NA brown      light      brown             NA female femin~
## 5 Raymus ~    188    79 brown      light      brown             NA male   mascu~
## 6 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 7 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

但是需要注意,filter()函数不支持直接使用行号进行筛选,比如说你想选择第1行到第3行,下面这种写法是错误的:

starwars %>% filter(1:3)

这种情况应该使用slice()函数:

starwars %>% slice(1:3)
## # A tibble: 3 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>
## 1 Luke Sk~    172    77 blond      fair        blue              19 male  mascu~
## 2 C-3PO       167    75 <NA>       gold        yellow           112 none  mascu~
## 3 R2-D2        96    32 <NA>       white, blue red               33 none  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

arrange()进行排序

arrange()函数是用来排序的,根据某一列进行排序。

starwars %>% arrange(height, mass)
## # A tibble: 87 x 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
##  1 Yoda         66    17 white      green      brown            896 male  mascu~
##  2 Ratts T~     79    15 none       grey, blue unknown           NA male  mascu~
##  3 Wicket ~     88    20 brown      brown      brown              8 male  mascu~
##  4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu~
##  5 R2-D2        96    32 <NA>       white, bl~ red               33 none  mascu~
##  6 R4-P17       96    NA none       silver, r~ red, blue         NA none  femin~
##  7 R5-D4        97    32 <NA>       white, red red               NA none  mascu~
##  8 Sebulba     112    40 none       grey, red  orange            NA male  mascu~
##  9 Gasgano     122    NA none       white, bl~ black             NA male  mascu~
## 10 Watto       137    NA black      blue, grey yellow            NA male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

desc()函数可以进行倒序:

starwars %>% arrange(desc(height))
## # A tibble: 87 x 14
##    name    height  mass hair_color skin_color  eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>
##  1 Yarael~    264    NA none       white       yellow          NA   male  mascu~
##  2 Tarfful    234   136 brown      brown       blue            NA   male  mascu~
##  3 Lama Su    229    88 none       grey        black           NA   male  mascu~
##  4 Chewba~    228   112 brown      unknown     blue           200   male  mascu~
##  5 Roos T~    224    82 none       grey        orange          NA   male  mascu~
##  6 Grievo~    216   159 none       brown, whi~ green, y~       NA   male  mascu~
##  7 Taun We    213    NA none       grey        black           NA   fema~ femin~
##  8 Rugor ~    206    NA none       green       orange          NA   male  mascu~
##  9 Tion M~    206    80 none       grey        black           NA   male  mascu~
## 10 Darth ~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

slice()根据位置选择行

选择第5-10行的数据:

starwars %>% slice(5:10)
## # A tibble: 6 x 14
##   name     height  mass hair_color  skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
## 1 Leia Or~    150    49 brown       light      brown             19 fema~ femin~
## 2 Owen La~    178   120 brown, grey light      blue              52 male  mascu~
## 3 Beru Wh~    165    75 brown       light      blue              47 fema~ femin~
## 4 R5-D4        97    32 <NA>        white, red red               NA none  mascu~
## 5 Biggs D~    183    84 black       light      brown             24 male  mascu~
## 6 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

这其实是一组函数,还有各种变体,比如下面这个,选择前4行:

starwars %>% slice_head(n = 4)
## # A tibble: 4 x 14
##   name     height  mass hair_color skin_color  eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>
## 1 Luke Sk~    172    77 blond      fair        blue            19   male  mascu~
## 2 C-3PO       167    75 <NA>       gold        yellow         112   none  mascu~
## 3 R2-D2        96    32 <NA>       white, blue red             33   none  mascu~
## 4 Darth V~    202   136 none       white       yellow          41.9 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

选择最后面的10%的行:

starwars %>% slice_tail(prop = 0.1)
## # A tibble: 8 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex    gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>  <chr>
## 1 Sly Moo~    178    48 none       pale       white             NA <NA>   <NA>
## 2 Tion Me~    206    80 none       grey       black             NA male   mascu~
## 3 Finn         NA    NA black      dark       dark              NA male   mascu~
## 4 Rey          NA    NA brown      light      hazel             NA female femin~
## 5 Poe Dam~     NA    NA brown      light      brown             NA male   mascu~
## 6 BB8          NA    NA none       none       black             NA none   mascu~
## 7 Captain~     NA    NA unknown    unknown    unknown           NA <NA>   <NA>
## 8 Padmé A~    165    45 brown      light      brown             46 female femin~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

随机选择10%的行:

starwars %>% slice_sample(prop = 0.1) # n=2
## # A tibble: 8 x 14
##   name    height  mass hair_color skin_color   eye_color birth_year sex   gender
##   <chr>    <int> <dbl> <chr>      <chr>        <chr>          <dbl> <chr> <chr>
## 1 Lobot      175    79 none       light        blue              37 male  mascu~
## 2 Zam We~    168    55 blonde     fair, green~ yellow            NA fema~ femin~
## 3 Ric Ol~    183    NA brown      fair         blue              NA <NA>  <NA>
## 4 R4-P17      96    NA none       silver, red  red, blue         NA none  femin~
## 5 Lando ~    177    79 black      dark         brown             31 male  mascu~
## 6 Greedo     173    74 <NA>       green        black             44 male  mascu~
## 7 Ackbar     180    83 none       brown mottle orange            41 male  mascu~
## 8 Rugor ~    206    NA none       green        orange            NA male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

随机选择10%的行,可以重复:

starwars %>% slice_sample(n=10, replace = T)
## # A tibble: 10 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
##  1 Jek To~    180   110 brown       fair       blue              NA male  mascu~
##  2 Quarsh~    183    NA black       dark       brown             62 <NA>  <NA>
##  3 Arvel ~     NA    NA brown       fair       brown             NA male  mascu~
##  4 Darth ~    175    80 none        red        yellow            54 male  mascu~
##  5 Finn        NA    NA black       dark       dark              NA male  mascu~
##  6 Ric Ol~    183    NA brown       fair       blue              NA <NA>  <NA>
##  7 Mace W~    188    84 none        dark       brown             72 male  mascu~
##  8 Jango ~    183    79 black       tan        brown             66 male  mascu~
##  9 San Hi~    191    NA none        grey       gold              NA male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray         57 male  mascu~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

选择某一列中最大或者最小的几个值所在的行,注意不能有NA值:

starwars %>% filter(!is.na(height)) %>% slice_max(height, n = 5) # 选择的这列不能有NA
## # A tibble: 5 x 14
##   name     height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>
## 1 Yarael ~    264    NA none       white      yellow            NA male  mascul~
## 2 Tarfful     234   136 brown      brown      blue              NA male  mascul~
## 3 Lama Su     229    88 none       grey       black             NA male  mascul~
## 4 Chewbac~    228   112 brown      unknown    blue             200 male  mascul~
## 5 Roos Ta~    224    82 none       grey       orange            NA male  mascul~
## # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

select()选择列

直接根据列名选择列,列名不需要使用引号:

starwars %>% select(hair_color, skin_color, eye_color)
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>
##  1 blond         fair        blue
##  2 <NA>          gold        yellow
##  3 <NA>          white, blue red
##  4 none          white       yellow
##  5 brown         light       brown
##  6 brown, grey   light       blue
##  7 brown         light       blue
##  8 <NA>          white, red  red
##  9 black         light       brown
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中以color结尾的列:

starwars %>% select(ends_with("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>
##  1 blond         fair        blue
##  2 <NA>          gold        yellow
##  3 <NA>          white, blue red
##  4 none          white       yellow
##  5 brown         light       brown
##  6 brown, grey   light       blue
##  7 brown         light       blue
##  8 <NA>          white, red  red
##  9 black         light       brown
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

选择列名中包含color字样的列:

starwars %>% select(contains("color"))
## # A tibble: 87 x 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>
##  1 blond         fair        blue
##  2 <NA>          gold        yellow
##  3 <NA>          white, blue red
##  4 none          white       yellow
##  5 brown         light       brown
##  6 brown, grey   light       blue
##  7 brown         light       blue
##  8 <NA>          white, red  red
##  9 black         light       brown
## 10 auburn, white fair        blue-gray
## # ... with 77 more rows

重命名列:

starwars %>% rename(home_world = homeworld)
## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: home_world <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

mutate()新建列

starwars %>% mutate(height_m = height/100, .before = 1)
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex
##       <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male
##  2     1.67 C-3PO     167    75 <NA>       gold       yellow         112   none
##  3     0.96 R2-D2      96    32 <NA>       white, bl~ red             33   none
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32 <NA>       white, red red             NA   none
##  9     1.83 Biggs~    183    84 black      light      brown           24   male
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

这个效果和上面是一样的,都是把新建的列放在最前面:

starwars %>% mutate(height_m = height/100) %>% select(height_m, everything())
## # A tibble: 87 x 15
##    height_m name   height  mass hair_color skin_color eye_color birth_year sex
##       <dbl> <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
##  1     1.72 Luke ~    172    77 blond      fair       blue            19   male
##  2     1.67 C-3PO     167    75 <NA>       gold       yellow         112   none
##  3     0.96 R2-D2      96    32 <NA>       white, bl~ red             33   none
##  4     2.02 Darth~    202   136 none       white      yellow          41.9 male
##  5     1.5  Leia ~    150    49 brown      light      brown           19   fema~
##  6     1.78 Owen ~    178   120 brown, gr~ light      blue            52   male
##  7     1.65 Beru ~    165    75 brown      light      blue            47   fema~
##  8     0.97 R5-D4      97    32 <NA>       white, red red             NA   none
##  9     1.83 Biggs~    183    84 black      light      brown           24   male
## 10     1.82 Obi-W~    182    77 auburn, w~ fair       blue-gray       57   male
## # ... with 77 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>

新建的列可以直接被使用:

starwars %>%mutate(height_m = height / 100,BMI = mass / (height_m^2)) %>%select(BMI, everything())
## # A tibble: 87 x 16
##      BMI name     height  mass hair_color  skin_color eye_color birth_year sex
##    <dbl> <chr>     <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr>
##  1  26.0 Luke Sk~    172    77 blond       fair       blue            19   male
##  2  26.9 C-3PO       167    75 <NA>        gold       yellow         112   none
##  3  34.7 R2-D2        96    32 <NA>        white, bl~ red             33   none
##  4  33.3 Darth V~    202   136 none        white      yellow          41.9 male
##  5  21.8 Leia Or~    150    49 brown       light      brown           19   fema~
##  6  37.9 Owen La~    178   120 brown, grey light      blue            52   male
##  7  27.5 Beru Wh~    165    75 brown       light      blue            47   fema~
##  8  34.0 R5-D4        97    32 <NA>        white, red red             NA   none
##  9  25.1 Biggs D~    183    84 black       light      brown           24   male
## 10  23.2 Obi-Wan~    182    77 auburn, wh~ fair       blue-gray       57   male
## # ... with 77 more rows, and 7 more variables: gender <chr>, homeworld <chr>,
## #   species <chr>, films <list>, vehicles <list>, starships <list>,
## #   height_m <dbl>

只保留新建的列,其他列不要了:

starwars %>%transmute(height_m = height / 100,BMI = mass / (height_m^2))
## # A tibble: 87 x 2
##    height_m   BMI
##       <dbl> <dbl>
##  1     1.72  26.0
##  2     1.67  26.9
##  3     0.96  34.7
##  4     2.02  33.3
##  5     1.5   21.8
##  6     1.78  37.9
##  7     1.65  27.5
##  8     0.97  34.0
##  9     1.83  25.1
## 10     1.82  23.2
## # ... with 77 more rows

relocate()重排列的位置

主要是使用.before.after参数,控制位置:

starwars %>% relocate(sex:homeworld, .before = height)
## # A tibble: 87 x 14
##    name     sex    gender homeworld height  mass hair_color skin_color eye_color
##    <chr>    <chr>  <chr>  <chr>      <int> <dbl> <chr>      <chr>      <chr>
##  1 Luke Sk~ male   mascu~ Tatooine     172    77 blond      fair       blue
##  2 C-3PO    none   mascu~ Tatooine     167    75 <NA>       gold       yellow
##  3 R2-D2    none   mascu~ Naboo         96    32 <NA>       white, bl~ red
##  4 Darth V~ male   mascu~ Tatooine     202   136 none       white      yellow
##  5 Leia Or~ female femin~ Alderaan     150    49 brown      light      brown
##  6 Owen La~ male   mascu~ Tatooine     178   120 brown, gr~ light      blue
##  7 Beru Wh~ female femin~ Tatooine     165    75 brown      light      blue
##  8 R5-D4    none   mascu~ Tatooine      97    32 <NA>       white, red red
##  9 Biggs D~ male   mascu~ Tatooine     183    84 black      light      brown
## 10 Obi-Wan~ male   mascu~ Stewjon      182    77 auburn, w~ fair       blue-gray
## # ... with 77 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

summarise()汇总

一般来说和group_by()连用才能发挥威力。

starwars %>% summarise(height = mean(height, na.rm = T))
## # A tibble: 1 x 1
##   height
##    <dbl>
## 1   174.

今天主要是对dplyr有一个大致的认识,熟悉下最常见的操作,后面会根据不同的应用场景继续介绍更多的内容。

grouped data

在现实生活中我们经常会遇到非常多需要分组汇总的情况,单个的汇总价值不大,只有分组之后,才能看出差异,才能表现出数据的价值。

dplyr为我们提供了group_by()函数,主要使用group_by()对数据进行分组,然后再进行各种计算,通过和其他操作进行连接,发挥更加强大的作用。

group_by()

先建立2个分组数据进行演示,还是使用星战数据集。

by_species <- starwars %>% group_by(species)
by_sex_gender <- starwars %>% group_by(sex, gender)

看看这两个对象有什么不同,可以看出和原数据集没什么不同,但是都被分组了

by_species
## # A tibble: 87 x 14
## # Groups:   species [38]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
by_sex_gender
## # A tibble: 87 x 14
## # Groups:   sex, gender [6]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

使用tally()函数进行计数:

by_species %>% tally(sort = T)
## # A tibble: 38 x 2
##    species      n
##    <chr>    <int>
##  1 Human       35
##  2 Droid        6
##  3 <NA>         4
##  4 Gungan       3
##  5 Kaminoan     2
##  6 Mirialan     2
##  7 Twi'lek      2
##  8 Wookiee      2
##  9 Zabrak       2
## 10 Aleena       1
## # ... with 28 more rows

和下面这个操作是一样的效果:

by_species %>% summarise(n=n())
## # A tibble: 38 x 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # ... with 28 more rows

除了根据现有的变量进行分组外,还可以根据现有变量的函数进行分组,这样做类似于先mutate()group_by()

bmi_breaks <- c(0,18.5,25,30,Inf)starwars %>% group_by(bmi_cat = cut(mass/(height/100)^2,breaks = bmi_breaks)) %>% tally(sort = T)
## # A tibble: 5 x 2
##   bmi_cat       n
##   <fct>     <int>
## 1 <NA>         28
## 2 (18.5,25]    24
## 3 (25,30]      13
## 4 (30,Inf]     12
## 5 (0,18.5]     10

是不是很神奇?

查看分组信息

group_keys()查看用于分组的组内有哪些类别,可以看到species有38种:

by_species %>% group_keys()
## # A tibble: 38 x 1
##    species
##    <chr>
##  1 Aleena
##  2 Besalisk
##  3 Cerean
##  4 Chagrian
##  5 Clawdite
##  6 Droid
##  7 Dug
##  8 Ewok
##  9 Geonosian
## 10 Gungan
## # ... with 28 more rows
by_sex_gender %>% group_keys()
## # A tibble: 6 x 2
##   sex            gender
##   <chr>          <chr>
## 1 female         feminine
## 2 hermaphroditic masculine
## 3 male           masculine
## 4 none           feminine
## 5 none           masculine
## 6 <NA>           <NA>

group_indices()查看每一行属于哪个组:

by_species %>% group_indices() # which group each row belongs to
##  [1] 11  6  6 11 11 11 11  6 11 11 11 11 34 11 24 12 11 11 36 11 11  6 31 11 11
## [26] 18 11 11  8 26 11 21 11 10 10 10 38 30  7 38 11 37 32 32 33 35 29 11  3 20
## [51] 37 27 13 23 16  4 11 11 11  9 17 17 11 11 11 11  5  2 15 15 11  1  6 25 19
## [76] 28 14 34 11 38 22 11 11 11  6 38 11

group_rows()查看每个组包括哪些行:

by_species %>% group_rows() # which rows each group contains with
## <list_of<integer>[38]>
## [[1]]
## [1] 72
##
## [[2]]
## [1] 68
##
## [[3]]
## [1] 49
##
## [[4]]
## [1] 56
##
## [[5]]
## [1] 67
##
## [[6]]
## [1]  2  3  8 22 73 85
##
## [[7]]
## [1] 39
##
## [[8]]
## [1] 29
##
## [[9]]
## [1] 60
##
## [[10]]
## [1] 34 35 36
##
## [[11]]
##  [1]  1  4  5  6  7  9 10 11 12 14 17 18 20 21 24 25 27 28 31 33 41 48 57 58 59
## [26] 63 64 65 66 71 79 82 83 84 87
##
## [[12]]
## [1] 16
##
## [[13]]
## [1] 53
##
## [[14]]
## [1] 77
##
## [[15]]
## [1] 69 70
##
## [[16]]
## [1] 55
##
## [[17]]
## [1] 61 62
##
## [[18]]
## [1] 26
##
## [[19]]
## [1] 75
##
## [[20]]
## [1] 50
##
## [[21]]
## [1] 32
##
## [[22]]
## [1] 81
##
## [[23]]
## [1] 54
##
## [[24]]
## [1] 15
##
## [[25]]
## [1] 74
##
## [[26]]
## [1] 30
##
## [[27]]
## [1] 52
##
## [[28]]
## [1] 76
##
## [[29]]
## [1] 47
##
## [[30]]
## [1] 38
##
## [[31]]
## [1] 23
##
## [[32]]
## [1] 43 44
##
## [[33]]
## [1] 45
##
## [[34]]
## [1] 13 78
##
## [[35]]
## [1] 46
##
## [[36]]
## [1] 19
##
## [[37]]
## [1] 42 51
##
## [[38]]
## [1] 37 40 80 86

group_vars()查看用于聚合的变量名字:

by_sex_gender %>% group_vars() # the name of the grouping variable
## [1] "sex"    "gender"

增加或改变用于聚合的变量

如果把group_by()作用于已经聚合的变量,那数据会被覆盖,比如下面这个,by_species已经被species聚合了,再通过homeworld聚合,那结果只是homeworld的结果:

by_species %>% group_by(homeworld) %>% tally()
## # A tibble: 49 x 2
##    homeworld          n
##    <chr>          <int>
##  1 Alderaan           3
##  2 Aleen Minor        1
##  3 Bespin             1
##  4 Bestine IV         1
##  5 Cato Neimoidia     1
##  6 Cerea              1
##  7 Champala           1
##  8 Chandrila          1
##  9 Concord Dawn       1
## 10 Corellia           2
## # ... with 39 more rows

是不是之前没注意过这些小问题?

通过使用一个参数可以避免这个问题:

by_species %>% group_by(homeworld, .add = T) %>% tally()
## # A tibble: 58 x 3
## # Groups:   species [38]
##    species  homeworld       n
##    <chr>    <chr>       <int>
##  1 Aleena   Aleen Minor     1
##  2 Besalisk Ojom            1
##  3 Cerean   Cerea           1
##  4 Chagrian Champala        1
##  5 Clawdite Zolan           1
##  6 Droid    Naboo           1
##  7 Droid    Tatooine        2
##  8 Droid    <NA>            3
##  9 Dug      Malastare       1
## 10 Ewok     Endor           1
## # ... with 48 more rows

移除聚合的变量

一个被聚合的数据如果不解除聚合,那么后面的操作都会以聚合后的结果呈现出来,所以聚合之后一定要记得解除聚合!

by_species %>% ungroup() %>% tally()
## # A tibble: 1 x 1
##       n
##   <int>
## 1    87
by_sex_gender %>% ungroup(sex) %>% tally()
## # A tibble: 3 x 2
##   gender        n
##   <chr>     <int>
## 1 feminine     17
## 2 masculine    66
## 3 <NA>          4

联合使用

下面这部分主要介绍group_by和其他函数的联合使用:

summarise()

by_species %>%summarise(n = n(),height = mean(height, na.rm = TRUE))
## # A tibble: 38 x 3
##    species       n height
##    <chr>     <int>  <dbl>
##  1 Aleena        1    79
##  2 Besalisk      1   198
##  3 Cerean        1   198
##  4 Chagrian      1   196
##  5 Clawdite      1   168
##  6 Droid         6   131.
##  7 Dug           1   112
##  8 Ewok          1    88
##  9 Geonosian     1   183
## 10 Gungan        3   209.
## # ... with 28 more rows

control the grouping variables

通过.groups参数控制聚合变量:

by_sex_gender %>% summarise(n = n()) %>% group_vars()
## `summarise()` has grouped output by 'sex'. You can override using the
## `.groups` argument.
## [1] "sex"# 只通过sex进行聚合
by_sex_gender %>% summarise(n = n(), .groups = "drop_last") %>% group_vars()
## [1] "sex"
by_sex_gender %>% summarise(n = n(), .groups = "keep") %>% group_vars()
## [1] "sex"    "gender"# 不聚合了
by_sex_gender %>% summarise(n = n(), .groups = "drop") %>% group_vars()
## character(0)

select()/rename()/relocate()

by_species %>% select(mass) # grouped by species
## Adding missing grouping variables: `species`
## # A tibble: 87 x 2
## # Groups:   species [38]
##    species  mass
##    <chr>   <dbl>
##  1 Human      77
##  2 Droid      75
##  3 Droid      32
##  4 Human     136
##  5 Human      49
##  6 Human     120
##  7 Human      75
##  8 Droid      32
##  9 Human      84
## 10 Human      77
## # ... with 77 more rows
by_species %>% ungroup() %>% select(mass)
## # A tibble: 87 x 1
##     mass
##    <dbl>
##  1    77
##  2    75
##  3    32
##  4   136
##  5    49
##  6   120
##  7    75
##  8    32
##  9    84
## 10    77
## # ... with 77 more rows

arrange()

by_species %>% arrange(desc(mass)) %>% relocate(species, mass)
## # A tibble: 87 x 14
## # Groups:   species [38]
##    species   mass name   height hair_color skin_color eye_color birth_year sex
##    <chr>    <dbl> <chr>   <int> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Hutt      1358 Jabba~    175 <NA>       green-tan~ orange         600   herm~
##  2 Kaleesh    159 Griev~    216 none       brown, wh~ green, y~       NA   male
##  3 Droid      140 IG-88     200 none       metal      red             15   none
##  4 Human      136 Darth~    202 none       white      yellow          41.9 male
##  5 Wookiee    136 Tarff~    234 brown      brown      blue            NA   male
##  6 Human      120 Owen ~    178 brown, gr~ light      blue            52   male
##  7 Trandos~   113 Bossk     190 none       green      red             53   male
##  8 Wookiee    112 Chewb~    228 brown      unknown    blue           200   male
##  9 Human      110 Jek T~    180 brown      fair       blue            NA   male
## 10 Besalisk   102 Dexte~    198 none       brown      yellow          NA   male
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## #   films <list>, vehicles <list>, starships <list>

通过.by_group参数控制进行排序的先后位置,下面这个例子就是先根据species进行排序,再根据mass进行排序,和上面的不一样哦!

by_species %>% arrange(desc(mass), .by_group = T) %>% relocate(species, mass)
## # A tibble: 87 x 14
## # Groups:   species [38]
##    species   mass name   height hair_color skin_color eye_color birth_year sex
##    <chr>    <dbl> <chr>   <int> <chr>      <chr>      <chr>          <dbl> <chr>
##  1 Aleena      15 Ratts~     79 none       grey, blue unknown           NA male
##  2 Besalisk   102 Dexte~    198 none       brown      yellow            NA male
##  3 Cerean      82 Ki-Ad~    198 white      pale       yellow            92 male
##  4 Chagrian    NA Mas A~    196 none       blue       blue              NA male
##  5 Clawdite    55 Zam W~    168 blonde     fair, gre~ yellow            NA fema~
##  6 Droid      140 IG-88     200 none       metal      red               15 none
##  7 Droid       75 C-3PO     167 <NA>       gold       yellow           112 none
##  8 Droid       32 R2-D2      96 <NA>       white, bl~ red               33 none
##  9 Droid       32 R5-D4      97 <NA>       white, red red               NA none
## 10 Droid       NA R4-P17     96 none       silver, r~ red, blue         NA none
## # ... with 77 more rows, and 5 more variables: gender <chr>, homeworld <chr>,
## #   films <list>, vehicles <list>, starships <list>

muatate() and transmutate()

starwars %>% select(name, homeworld, mass) %>% group_by(homeworld) %>% mutate(means = mean(mass, na.rm = T), standard_mass = mass - mean(mass, na.rm = T))
## # A tibble: 87 x 5
## # Groups:   homeworld [49]
##    name               homeworld  mass means standard_mass
##    <chr>              <chr>     <dbl> <dbl>         <dbl>
##  1 Luke Skywalker     Tatooine     77  85.4         -8.38
##  2 C-3PO              Tatooine     75  85.4        -10.4
##  3 R2-D2              Naboo        32  64.2        -32.2
##  4 Darth Vader        Tatooine    136  85.4         50.6
##  5 Leia Organa        Alderaan     49  64          -15
##  6 Owen Lars          Tatooine    120  85.4         34.6
##  7 Beru Whitesun lars Tatooine     75  85.4        -10.4
##  8 R5-D4              Tatooine     32  85.4        -53.4
##  9 Biggs Darklighter  Tatooine     84  85.4         -1.38
## 10 Obi-Wan Kenobi     Stewjon      77  77            0
## # ... with 77 more rows

min_rank()函数返回顺序(秩次):

# Overall rank
starwars %>% select(name, homeworld, height) %>% mutate(rank = min_rank(height))
## # A tibble: 87 x 4
##    name               homeworld height  rank
##    <chr>              <chr>      <int> <int>
##  1 Luke Skywalker     Tatooine     172    29
##  2 C-3PO              Tatooine     167    21
##  3 R2-D2              Naboo         96     5
##  4 Darth Vader        Tatooine     202    72
##  5 Leia Organa        Alderaan     150    11
##  6 Owen Lars          Tatooine     178    35
##  7 Beru Whitesun lars Tatooine     165    17
##  8 R5-D4              Tatooine      97     7
##  9 Biggs Darklighter  Tatooine     183    45
## 10 Obi-Wan Kenobi     Stewjon      182    44
## # ... with 77 more rows

先根据homeworld进行分组,再新建列:

# Rank per homeworld
starwars %>% select(name, homeworld, height) %>% group_by(homeworld) %>% mutate(rank = min_rank(height))
## # A tibble: 87 x 4
## # Groups:   homeworld [49]
##    name               homeworld height  rank
##    <chr>              <chr>      <int> <int>
##  1 Luke Skywalker     Tatooine     172     5
##  2 C-3PO              Tatooine     167     4
##  3 R2-D2              Naboo         96     1
##  4 Darth Vader        Tatooine     202    10
##  5 Leia Organa        Alderaan     150     1
##  6 Owen Lars          Tatooine     178     6
##  7 Beru Whitesun lars Tatooine     165     3
##  8 R5-D4              Tatooine      97     1
##  9 Biggs Darklighter  Tatooine     183     7
## 10 Obi-Wan Kenobi     Stewjon      182     1
## # ... with 77 more rows

filter()

筛选每个物种(species)中最高(height)的那一个:

by_species %>%select(name, species, height) %>% filter(height == max(height))
## # A tibble: 35 x 3
## # Groups:   species [35]
##    name                  species        height
##    <chr>                 <chr>           <int>
##  1 Greedo                Rodian            173
##  2 Jabba Desilijic Tiure Hutt              175
##  3 Yoda                  Yoda's species     66
##  4 Bossk                 Trandoshan        190
##  5 Ackbar                Mon Calamari      180
##  6 Wicket Systri Warrick Ewok               88
##  7 Nien Nunb             Sullustan         160
##  8 Nute Gunray           Neimodian         191
##  9 Roos Tarpals          Gungan            224
## 10 Watto                 Toydarian         137
## # ... with 25 more rows

去掉只有1个成员的物种:

by_species %>%filter(n() != 1) %>% tally()
## # A tibble: 9 x 2
##   species      n
##   <chr>    <int>
## 1 Droid        6
## 2 Gungan       3
## 3 Human       35
## 4 Kaminoan     2
## 5 Mirialan     2
## 6 Twi'lek      2
## 7 Wookiee      2
## 8 Zabrak       2
## 9 <NA>         4

Computing on grouping information

在dplyr verbs内部,可以使用带有cur前缀的函数族访问当前组的各种属性。

cur_data()

cur_data() returns the current group, excluding grouping variables. It’s useful to feed to functions that take a whole data frame. For example, the following code fits a linear model of mass ~ height to each species:

by_species %>%filter(n() > 1) %>% mutate(mod = list(lm(mass ~ height, data = cur_data())))
## # A tibble: 58 x 15
## # Groups:   species [9]
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 48 more rows, and 6 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, mod <list>

cur_group() and cur_group_id()

by_species %>%arrange(species) %>% select(name, species, homeworld) %>% mutate(id = cur_group_id())
## # A tibble: 87 x 4
## # Groups:   species [38]
##    name            species  homeworld      id
##    <chr>           <chr>    <chr>       <int>
##  1 Ratts Tyerell   Aleena   Aleen Minor     1
##  2 Dexter Jettster Besalisk Ojom            2
##  3 Ki-Adi-Mundi    Cerean   Cerea           3
##  4 Mas Amedda      Chagrian Champala        4
##  5 Zam Wesell      Clawdite Zolan           5
##  6 C-3PO           Droid    Tatooine        6
##  7 R2-D2           Droid    Naboo           6
##  8 R5-D4           Droid    Tatooine        6
##  9 IG-88           Droid    <NA>            6
## 10 R4-P17          Droid    <NA>            6
## # ... with 77 more rows

two-table verbs

可以参考R数据科学这本书中的介绍,非常详细。

处理两个数据集的函数。

  • 根据另一个表的变量新建数据
  • 根据另一个表筛选

合并连接

  • 内连接

    • inner_join()
  • 外连接

    • 左连接left_join():保留 x 中的所有观测。
    • 右连接right_join():保留 y 中的所有观测
    • 全连接full_join():保留 x 和 y 中的所有观测。

筛选连接

  • semi_join(x, y):保留x表中与y表中的观测相匹配的所有观测
  • anti_join(x, y):丢弃x表中与y表中的观测相匹配的所有观测

集合操作

  • intersect(x, y):返回既在 x 表,又在 y 表中的观测
  • union(x, y):返回 x 表或 y 表中的唯一观测
  • setdiff(x, y):返回在 x 表,但不在 y 表中的观测

合并连接

library(nycflights13)
library(dplyr)
# 选择部分数据方便演示
flights2 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
glimpse(flights2)
## Rows: 336,776
## Columns: 8
## $ year    <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 20~
## $ month   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour    <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6,~
## $ origin  <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK",~
## $ dest    <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO",~
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N~
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B~
glimpse(airlines)
## Rows: 16
## Columns: 2
## $ carrier <chr> "9E", "AA", "AS", "B6", "DL", "EV", "F9", "FL", "HA", "MQ", "O~
## $ name    <chr> "Endeavor Air Inc.", "American Airlines Inc.", "Alaska Airline~

有一列共有的,carrier

flights2 %>% left_join(airlines)
## Joining, by = "carrier"
## # A tibble: 336,776 x 9
##     year month   day  hour origin dest  tailnum carrier name
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.
## # ... with 336,766 more rows

by = NULL是默认。

glimpse(weather)
## Rows: 26,115
## Columns: 15
## $ origin     <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EW~
## $ year       <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,~
## $ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ hour       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, ~
## $ temp       <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39.~
## $ dewp       <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28.~
## $ humid      <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62.~
## $ wind_dir   <dbl> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330,~
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780, ~
## $ wind_gust  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20.~
## $ precip     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ pressure   <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 101~
## $ visib      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
## $ time_hour  <dttm> 2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:00~
flights2 %>% left_join(weather)
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 18
##     year month   day  hour origin dest  tailnum carrier  temp  dewp humid
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
## # ... with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
## #   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>
glimpse(planes)
## Rows: 3,322
## Columns: 9
## $ tailnum      <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW~
## $ year         <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199~
## $ type         <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi~
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU~
## $ model        <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145~
## $ engines      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ seats        <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5~
## $ speed        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
## $ engine       <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb~
flights2 %>% left_join(planes, by = "tailnum")
## # A tibble: 336,776 x 16
##    year.x month   day  hour origin dest  tailnum carrier year.y type
##     <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>
##  1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult~
##  2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult~
##  3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult~
##  4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult~
##  5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult~
##  6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult~
##  7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult~
##  8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult~
##  9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult~
## 10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>
## # ... with 336,766 more rows, and 6 more variables: manufacturer <chr>,
## #   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
glimpse(airports)
## Rows: 1,458
## Columns: 8
## $ faa   <chr> "04G", "06A", "06C", "06N", "09J", "0A9", "0G6", "0G7", "0P2", "~
## $ name  <chr> "Lansdowne Airport", "Moton Field Municipal Airport", "Schaumbur~
## $ lat   <dbl> 41.13047, 32.46057, 41.98934, 41.43191, 31.07447, 36.37122, 41.4~
## $ lon   <dbl> -80.61958, -85.68003, -88.10124, -74.39156, -81.42778, -82.17342~
## $ alt   <dbl> 1044, 264, 801, 523, 11, 1593, 730, 492, 1000, 108, 409, 875, 10~
## $ tz    <dbl> -5, -6, -6, -5, -5, -5, -5, -5, -5, -8, -5, -6, -5, -5, -5, -5, ~
## $ dst   <chr> "A", "A", "A", "A", "A", "A", "A", "A", "U", "A", "A", "U", "A",~
## $ tzone <chr> "America/New_York", "America/Chicago", "America/Chicago", "Ameri~

如果两个数据集中列名不一样也可以进行合并:

flights2 %>% left_join(airports, c("dest" = "faa"))
## # A tibble: 336,776 x 15
##     year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Georg~  30.0 -95.3    97
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Georg~  30.0 -95.3    97
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami~  25.8 -80.3     8
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts~  33.6 -84.4  1026
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Chica~  42.0 -87.9   668
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort ~  26.1 -80.2     9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi~  38.9 -77.5   313
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan~  28.4 -81.3    96
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica~  42.0 -87.9   668
## # ... with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
## #   tzone <chr>

下面是一个简单的左连接的例子:

df1 <- tibble(x = c(1, 2), y = 2:1)
df2 <- tibble(x = c(3, 1), a = 10, b = "a")df1
## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     1     2
## 2     2     1
df2
## # A tibble: 2 x 3
##       x     a b
##   <dbl> <dbl> <chr>
## 1     3    10 a
## 2     1    10 a
df1 %>% inner_join(df2)
## Joining, by = "x"
## # A tibble: 1 x 4
##       x     y     a b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a
df1 %>% left_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a
## 2     2     1    NA <NA>

右连接:

df1 %>% right_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     y     a b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a
## 2     3    NA    10 a
df2 %>% left_join(df1)
## Joining, by = "x"
## # A tibble: 2 x 4
##       x     a b         y
##   <dbl> <dbl> <chr> <int>
## 1     3    10 a        NA
## 2     1    10 a         2

全连接:

df1 %>% full_join(df2)
## Joining, by = "x"
## # A tibble: 3 x 4
##       x     y     a b
##   <dbl> <int> <dbl> <chr>
## 1     1     2    10 a
## 2     2     1    NA <NA>
## 3     3    NA    10 a

筛选连接

df1 <- tibble(x = c(1, 1, 3, 4), y = 1:4)
df2 <- tibble(x = c(1, 1, 2), z = c("a", "b", "a"))df1
## # A tibble: 4 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2
## 3     3     3
## 4     4     4
df2
## # A tibble: 3 x 2
##       x z
##   <dbl> <chr>
## 1     1 a
## 2     1 b
## 3     2 a
df1 %>% nrow()
## [1] 4df1 %>% inner_join(df2, by = "x")
## # A tibble: 4 x 3
##       x     y z
##   <dbl> <int> <chr>
## 1     1     1 a
## 2     1     1 b
## 3     1     2 a
## 4     1     2 bdf1 %>% semi_join(df2, by = "x")
## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     1     1
## 2     1     2df1 %>% anti_join(df2)
## Joining, by = "x"
## # A tibble: 2 x 2
##       x     y
##   <dbl> <int>
## 1     3     3
## 2     4     4

集合操作

(df1 <- tibble(x = 1:2, y = c(1L, 1L)))
## # A tibble: 2 x 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     1
(df2 <- tibble(x = 1:2, y = 1:2))
## # A tibble: 2 x 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     2
intersect(df1, df2) # 取交集
## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     1     1union(df1, df2) # 并集
## # A tibble: 3 x 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     1
## 3     2     2setdiff(df1, df2)
## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     2     1
setdiff(df2, df1)
## # A tibble: 1 x 2
##       x     y
##   <int> <int>
## 1     2     2

下面是一些集合操作的示意图:

column-wise operations

主要是介绍across函数的用法,这是dplyr1.0才出来的一个函数,大大简化了代码

可用于对多列做同一个操作。

library(dplyr, warn.conflicts = FALSE)

across()有两个基本参数:

  • .cols:选择你想操作的列
  • .fn:你想进行的操作,可以使一个函数或者多个函数组成的列表

可以替代_if(),at_(),all_()

starwars %>% summarise(across(where(is.character), n_distinct))
## # A tibble: 1 x 8
##    name hair_color skin_color eye_color   sex gender homeworld species
##   <int>      <int>      <int>     <int> <int>  <int>     <int>   <int>
## 1    87         13         31        15     5      3        49      38

可以直接写列名:

starwars %>% group_by(species) %>% filter(n() > 1) %>% summarise(across(c(sex, gender, homeworld), n_distinct))
## # A tibble: 9 x 4
##   species    sex gender homeworld
##   <chr>    <int>  <int>     <int>
## 1 Droid        1      2         3
## 2 Gungan       1      1         1
## 3 Human        2      2        16
## 4 Kaminoan     2      2         1
## 5 Mirialan     1      1         1
## 6 Twi'lek      2      2         1
## 7 Wookiee      1      1         1
## 8 Zabrak       1      1         2
## 9 <NA>         1      1         3

也可以和where函数连用,省时省力:

starwars %>% group_by(homeworld) %>% filter(n() > 1) %>% summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
## # A tibble: 10 x 4
##    homeworld height  mass birth_year
##    <chr>      <dbl> <dbl>      <dbl>
##  1 Alderaan    176.  64         43
##  2 Corellia    175   78.5       25
##  3 Coruscant   174.  50         91
##  4 Kamino      208.  83.1       31.5
##  5 Kashyyyk    231  124        200
##  6 Mirial      168   53.1       49
##  7 Naboo       175.  64.2       55
##  8 Ryloth      179   55         48
##  9 Tatooine    170.  85.4       54.6
## 10 <NA>        139.  82        334.

如果没有缺失值,可以直接写mean,

library(tidyr)
starwars %>% drop_na() %>% group_by(homeworld) %>% filter(n() > 1) %>% summarise(across(where(is.numeric), mean))
## # A tibble: 4 x 4
##   homeworld height  mass birth_year
##   <chr>      <dbl> <dbl>      <dbl>
## 1 Corellia    175   78.5       25
## 2 Mirial      168   53.1       49
## 3 Naboo       177   62         60
## 4 Tatooine    181.  96         37.6

acorss支持多个函数同时使用,只要放入列表中即可:

min_max <- list(min = ~min(.x, na.rm = TRUE), max = ~max(.x, na.rm = TRUE)
)
starwars %>% summarise(across(where(is.numeric), min_max))
## # A tibble: 1 x 6
##   height_min height_max mass_min mass_max birth_year_min birth_year_max
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896
starwars %>% summarise(across(c(height, mass, birth_year), min_max))
## # A tibble: 1 x 6
##   height_min height_max mass_min mass_max birth_year_min birth_year_max
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896

当然也是支持glue的:

starwars %>% summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
##   min.height max.height min.mass max.mass min.birth_year max.birth_year
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896
starwars %>% summarise(across(c(height, mass, birth_year), min_max, .names = "{.fn}.{.col}"))
## # A tibble: 1 x 6
##   min.height max.height min.mass max.mass min.birth_year max.birth_year
##        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
## 1         66        264       15     1358              8            896

分开写也是可以的:

starwars %>% summarise(across(c(height, mass, birth_year), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),across(c(height, mass, birth_year), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
)
## # A tibble: 1 x 6
##   min_height min_mass min_birth_year max_height max_mass max_birth_year
##        <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
## 1         66       15              8        264     1358            896

这种情况不能使用where(is.numeric),因为第2个across会使用新创建的列(“min_height”, “min_mass” and “min_birth_year”)。

可以放在tibble里解决:

starwars %>% summarise(tibble(across(where(is.numeric), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),across(where(is.numeric), ~max(.x, na.rm = TRUE), .names = "max_{.col}")  )
)
## # A tibble: 1 x 6
##   min_height min_mass min_birth_year max_height max_mass max_birth_year
##        <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
## 1         66       15              8        264     1358            896

陷阱

在使用where(is.numeric),要注意下面这种情况:

df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))df %>% summarise(n = n(), across(where(is.numeric), sd))
##    n x        y
## 1 NA 1 4.041452

n这里是3,是一个常数,所以它的sd变成了NA,可以通过换一下顺序解决:

df %>% summarise(across(where(is.numeric), sd),n = n())
##   x        y n
## 1 1 4.041452 3

或者通过下面两种方法解决:

df %>% summarise(n = n(), across(where(is.numeric) & !n, sd))
##   n x        y
## 1 3 1 4.041452df %>% summarise(tibble(n = n(), across(where(is.numeric), sd)))
##   n x        y
## 1 3 1 4.041452

across其他连用

还可以和group_by()/count()/distinct()连用。

和filter()连用

across()不能直接和filter()连用,和filter()连用的是if_any()if_all()

  • if_any():任何一列满足条件即可
  • if_all():所有列都要满足条件
starwars %>% filter(if_any(everything(), ~ !is.na(.x)))
## # A tibble: 87 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 C-3PO      167    75 <NA>        gold       yellow         112   none  mascu~
##  3 R2-D2       96    32 <NA>        white, bl~ red             33   none  mascu~
##  4 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  5 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  6 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  7 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  8 R5-D4       97    32 <NA>        white, red red             NA   none  mascu~
##  9 Biggs ~    183    84 black       light      brown           24   male  mascu~
## 10 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
## # ... with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
starwars %>% filter(if_all(everything(), ~ !is.na(.x)))
## # A tibble: 29 x 14
##    name    height  mass hair_color  skin_color eye_color birth_year sex   gender
##    <chr>    <int> <dbl> <chr>       <chr>      <chr>          <dbl> <chr> <chr>
##  1 Luke S~    172    77 blond       fair       blue            19   male  mascu~
##  2 Darth ~    202   136 none        white      yellow          41.9 male  mascu~
##  3 Leia O~    150    49 brown       light      brown           19   fema~ femin~
##  4 Owen L~    178   120 brown, grey light      blue            52   male  mascu~
##  5 Beru W~    165    75 brown       light      blue            47   fema~ femin~
##  6 Biggs ~    183    84 black       light      brown           24   male  mascu~
##  7 Obi-Wa~    182    77 auburn, wh~ fair       blue-gray       57   male  mascu~
##  8 Anakin~    188    84 blond       fair       blue            41.9 male  mascu~
##  9 Chewba~    228   112 brown       unknown    blue           200   male  mascu~
## 10 Han So~    180    80 brown       fair       brown           29   male  mascu~
## # ... with 19 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

row-wide operations

在tidyverse中,整洁数据一般都是每一行是一个观测,每一列是一个变量,基本上所有操作都是基于整洁的数据进行的,都是对某列做什么操作。但有时候我们也需要对某行做一些操作,dplyr中现在提供了rowwise()函数快速执行对行的操作。

简介

library(dplyr, warn.conflicts = FALSE)

rowwise()group_by()很像,本身不做任何操作,但是使用了rowwise之后,再和mutate()等函数连用时,就会变成按照行进行操作!

df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()
## # A tibble: 2 x 3
## # Rowwise:
##       x     y     z
##   <int> <int> <int>
## 1     1     3     5
## 2     2     4     6

假如你想分别计算每行的均值(只是一个例子),不使用rowwise()函数,得到的结果是所有数据的均值,很明显不是想要的:

df %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
##       x     y     z     m
##   <int> <int> <int> <dbl>
## 1     1     3     5   3.5
## 2     2     4     6   3.5

使用rowwise()之后,神奇的事情发生了,变成了按行操作!

df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
## # A tibble: 2 x 4
## # Rowwise:
##       x     y     z     m
##   <int> <int> <int> <dbl>
## 1     1     3     5     3
## 2     2     4     6     4
df <- tibble(name = c("Mara", "Hadley"), x = 1:2, y = 3:4, z = 5:6)
df
## # A tibble: 2 x 4
##   name       x     y     z
##   <chr>  <int> <int> <int>
## 1 Mara       1     3     5
## 2 Hadley     2     4     6

按照行计算均值:

df %>% rowwise() %>% summarise(m = mean(c(x, y, z)))
## # A tibble: 2 x 1
##       m
##   <dbl>
## 1     3
## 2     4

根据name这一列按照行计算均值:

df %>% rowwise(name) %>% summarise(m = mean(c(x, y, z)))
## `summarise()` has grouped output by 'name'. You can override using
## the `.groups` argument.
## # A tibble: 2 x 2
## # Groups:   name [2]
##   name       m
##   <chr>  <dbl>
## 1 Mara       3
## 2 Hadley     4

rowwise()可以看做是group_by()的特殊形式,本身也是对数据先进行聚合操作,所以如果要解除聚合,也要使用ungroup()函数。

对行进行汇总统计

df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
## # A tibble: 6 x 5
##      id     w     x     y     z
##   <int> <int> <int> <int> <int>
## 1     1    10    20    30    40
## 2     2    11    21    31    41
## 3     3    12    22    32    42
## 4     4    13    23    33    43
## 5     5    14    24    34    44
## 6     6    15    25    35    45

接下来要进行按行操作了!

rf <- df %>% rowwise(id)

计算加和:

rf %>% mutate(total = sum(c(w, x, y, z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##   <int> <int> <int> <int> <int> <int>
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120
rf %>% summarise(total = sum(c(w, x, y, z)))
## `summarise()` has grouped output by 'id'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 2
## # Groups:   id [6]
##      id total
##   <int> <int>
## 1     1   100
## 2     2   104
## 3     3   108
## 4     4   112
## 5     5   116
## 6     6   120
rf %>% mutate(total = sum(c_across(w:z)))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##   <int> <int> <int> <int> <int> <int>
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120
rf %>% mutate(total = sum(c_across(where(is.numeric))))
## # A tibble: 6 x 6
## # Rowwise:  id
##      id     w     x     y     z total
##   <int> <int> <int> <int> <int> <int>
## 1     1    10    20    30    40   100
## 2     2    11    21    31    41   104
## 3     3    12    22    32    42   108
## 4     4    13    23    33    43   112
## 5     5    14    24    34    44   116
## 6     6    15    25    35    45   120

可以和列操作联合使用:

rf %>% mutate(total = sum(c_across(w:z))) %>% ungroup() %>% mutate(across(w:z, ~ . / total))
## # A tibble: 6 x 6
##      id     w     x     y     z total
##   <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1     1 0.1   0.2   0.3   0.4     100
## 2     2 0.106 0.202 0.298 0.394   104
## 3     3 0.111 0.204 0.296 0.389   108
## 4     4 0.116 0.205 0.295 0.384   112
## 5     5 0.121 0.207 0.293 0.379   116
## 6     6 0.125 0.208 0.292 0.375   120

可以和``rowSums()函数和rowMeans()`等函数联合使用。

list columns

motivation

df <- tibble(x = list(1, 2:3, 4:6)
)df
## # A tibble: 3 x 1
##   x
##   <list>
## 1 <dbl [1]>
## 2 <int [2]>
## 3 <int [3]>
df %>% mutate(l = length(x))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     3
## 2 <int [2]>     3
## 3 <int [3]>     3
df %>% mutate(l = lengths(x))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3
df %>% mutate(l = sapply(x, length))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3
df %>% mutate(l = purrr::map_int(x, length))
## # A tibble: 3 x 2
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3
df %>% rowwise() %>% mutate(l = length(x))
## # A tibble: 3 x 2
## # Rowwise:
##   x             l
##   <list>    <int>
## 1 <dbl [1]>     1
## 2 <int [2]>     2
## 3 <int [3]>     3

subsetting

df <- tibble(g = 1:2, y = list(1:3, "a"))
gf <- df %>% group_by(g)
rf <- df %>% rowwise(g)
gf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Groups:   g [2]
##       g y         type  length
##   <int> <list>    <chr>  <int>
## 1     1 <int [3]> list       1
## 2     2 <chr [1]> list       1
rf %>% mutate(type = typeof(y), length = length(y))
## # A tibble: 2 x 4
## # Rowwise:  g
##       g y         type      length
##   <int> <list>    <chr>      <int>
## 1     1 <int [3]> integer        3
## 2     2 <chr [1]> character      1
# grouped
out1 <- integer(2)
for (i in 1:2) {out1[[i]] <- length(df$y[i])
}
out1
## [1] 1 1# rowwise
out2 <- integer(2)
for (i in 1:2) {out2[[i]] <- length(df$y[[i]])
}
out2
## [1] 3 1
gf %>% mutate(y2 = y)
## # A tibble: 2 x 3
## # Groups:   g [2]
##       g y         y2
##   <int> <list>    <list>
## 1     1 <int [3]> <int [3]>
## 2     2 <chr [1]> <chr [1]>
rf %>% mutate(y2 = y)
## Error in `mutate()`:
## ! Problem while computing `y2 = y`.
## x `y2` must be size 1, not 3.
## i Did you mean: `y2 = list(y)` ?
## i The error occurred in row 1.
rf %>% mutate(y2 = list(y))
## # A tibble: 2 x 3
## # Rowwise:  g
##       g y         y2
##   <int> <list>    <list>
## 1     1 <int [3]> <int [3]>
## 2     2 <chr [1]> <chr [1]>

modeling

by_cyl <- mtcars %>% nest_by(cyl)
by_cyl
## # A tibble: 3 x 2
## # Rowwise:  cyl
##     cyl                data
##   <dbl> <list<tibble[,10]>>
## 1     4           [11 x 10]
## 2     6            [7 x 10]
## 3     8           [14 x 10]
mods <- by_cyl %>% mutate(mod = list(lm(mpg ~ wt, data = data)))
mods
## # A tibble: 3 x 3
## # Rowwise:  cyl
##     cyl                data mod
##   <dbl> <list<tibble[,10]>> <list>
## 1     4           [11 x 10] <lm>
## 2     6            [7 x 10] <lm>
## 3     8           [14 x 10] <lm>
mods <- mods %>% mutate(pred = list(predict(mod, data)))
mods
## # A tibble: 3 x 4
## # Rowwise:  cyl
##     cyl                data mod    pred
##   <dbl> <list<tibble[,10]>> <list> <list>
## 1     4           [11 x 10] <lm>   <dbl [11]>
## 2     6            [7 x 10] <lm>   <dbl [7]>
## 3     8           [14 x 10] <lm>   <dbl [14]>
mods %>% summarise(rmse = sqrt(mean((pred - data$mpg) ^ 2)))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl  rmse
##   <dbl> <dbl>
## 1     4 3.01
## 2     6 0.985
## 3     8 1.87
mods %>% summarise(rsq = summary(mod)$r.squared)
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 2
## # Groups:   cyl [3]
##     cyl   rsq
##   <dbl> <dbl>
## 1     4 0.509
## 2     6 0.465
## 3     8 0.423
mods %>% summarise(broom::glance(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 3 x 13
## # Groups:   cyl [3]
##     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##   <dbl>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     4     0.509         0.454  3.33      9.32  0.0137     1 -27.7   61.5  62.7
## 2     6     0.465         0.357  1.17      4.34  0.0918     1  -9.83  25.7  25.5
## 3     8     0.423         0.375  2.02      8.80  0.0118     1 -28.7   63.3  65.2
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
mods %>% summarise(broom::tidy(mod))
## `summarise()` has grouped output by 'cyl'. You can override using the
## `.groups` argument.
## # A tibble: 6 x 6
## # Groups:   cyl [3]
##     cyl term        estimate std.error statistic    p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1     4 (Intercept)    39.6      4.35       9.10 0.00000777
## 2     4 wt             -5.65     1.85      -3.05 0.0137
## 3     6 (Intercept)    28.4      4.18       6.79 0.00105
## 4     6 wt             -2.78     1.33      -2.08 0.0918
## 5     8 (Intercept)    23.9      3.01       7.94 0.00000405
## 6     8 wt             -2.19     0.739     -2.97 0.0118

repeated function calls

simulations

df <- tribble(~ n, ~ min, ~ max,1,     0,     1,2,    10,   100,3,   100,  1000,
)
df %>% rowwise() %>% mutate(data = list(runif(n, min, max)))
## # A tibble: 3 x 4
## # Rowwise:
##       n   min   max data
##   <dbl> <dbl> <dbl> <list>
## 1     1     0     1 <dbl [1]>
## 2     2    10   100 <dbl [2]>
## 3     3   100  1000 <dbl [3]>
df %>% rowwise() %>% mutate(data = runif(n, min, max))
## Error in `mutate()`:
## ! Problem while computing `data = runif(n, min, max)`.
## x `data` must be size 1, not 2.
## i Did you mean: `data = list(runif(n, min, max))` ?
## i The error occurred in row 2.

multiple combinations

df <- expand.grid(mean = c(-1, 0, 1), sd = c(1, 10, 100))df %>% rowwise() %>% mutate(data = list(rnorm(10, mean, sd)))
## # A tibble: 9 x 3
## # Rowwise:
##    mean    sd data
##   <dbl> <dbl> <list>
## 1    -1     1 <dbl [10]>
## 2     0     1 <dbl [10]>
## 3     1     1 <dbl [10]>
## 4    -1    10 <dbl [10]>
## 5     0    10 <dbl [10]>
## 6     1    10 <dbl [10]>
## 7    -1   100 <dbl [10]>
## 8     0   100 <dbl [10]>
## 9     1   100 <dbl [10]>

varying functions

df <- tribble(~rng,     ~params,"runif",  list(n = 10), "rnorm",  list(n = 20),"rpois",  list(n = 10, lambda = 5),
) %>%rowwise()df %>% mutate(data = list(do.call(rng, params)))
## # A tibble: 3 x 3
## # Rowwise:
##   rng   params           data
##   <chr> <list>           <list>
## 1 runif <named list [1]> <dbl [10]>
## 2 rnorm <named list [1]> <dbl [20]>
## 3 rpois <named list [2]> <int [10]>

R语言dplyr入门到进阶相关推荐

  1. R语言dplyr包数据列重排(reorder)实战:把特定数据列移动到第一列、把特定数据列移动到最后一列、数据列多列重排、按照字母顺序重排数据列、把数据列反序

    R语言dplyr包数据列重排(reorder)实战:把特定数据列移动到第一列.把特定数据列移动到最后一列.数据列多列重排.按照字母顺序重排数据列.把数据列反序 目录

  2. R语言dplyr包通过数据列的索引重命名数据列实战(Rename Column by Index Position)

    R语言dplyr包通过数据列的索引重命名数据列实战(Rename Column by Index Position) 目录 R语言dplyr包通过数据列的索引重命名数据列实战(Rename Colum ...

  3. R语言dplyr包使用bind_rows函数纵向合并两个dataframe(行生长)、使用bind_cols函数横向合并两个dataframe(列生长)

    R语言dplyr包使用bind_rows函数纵向合并两个dataframe(行生长).使用bind_cols函数横向合并两个dataframe(列生长) 目录

  4. R语言dplyr包的top_n函数返回dataframe或tibble的前N行数据、dplyr包的top_frac函数返回dataframe或tibble的前百分之N(N%)的数据

    R语言dplyr包的top_n函数返回dataframe或tibble的前N行数据.dplyr包的top_frac函数返回dataframe或tibble的前百分之N(N%)的数据 目录

  5. R语言dplyr包数据过滤(filter)基于not in规则实战(not in Filter):基于单数据列not in规则过滤数据行、基于多数据列not in规则过滤数据行

    R语言dplyr包数据过滤(filter)基于not in规则实战(not in Filter):基于单数据列not in规则过滤数据行.基于多数据列not in规则过滤数据行 目录

  6. R语言dplyr包的mutate函数将列添加到dataframe中或者修改现有的数据列:基于条件判断创建布尔型指示变量、将异常离散编码转化为NA值

    R语言dplyr包的mutate函数将列添加到dataframe中或者修改现有的数据列:基于条件判断创建布尔型指示变量.将异常离散编码转化为NA值 目录

  7. R语言dplyr包的mutate函数将列添加到dataframe中或者修改现有的数据列:使用na_if()函数将0值替换为NA值、负收入替换为NA值

    R语言dplyr包的mutate函数将列添加到dataframe中或者修改现有的数据列:使用na_if()函数将0值替换为NA值.负收入替换为NA值 目录

  8. R语言dplyr包为dataframe添加数据列实战( Add Columns):基于mutate()函数添加一个或者多个数据列(尾部添加、头部添加、条件生成、某个具体数据列的前后)

    R语言dplyr包为dataframe添加数据列实战( Add Columns):基于mutate()函数添加一个或者多个数据列(尾部添加.头部添加.条件生成.某个具体数据列的前后) 目录

  9. R语言dplyr包和tidyr包创建交叉表(列联表、crosstab)实战

    R语言dplyr包和tidyr包创建交叉表(列联表.crosstab)实战 目录 R语言dplyr包和tidyr包创建交叉表(列联表.crosstab)实战

  10. R语言dplyr包将dataframe中的NA值替换(replace)为0实战:所有NA值替换(replace)为0、具体列的NA值替换(replace)为0、若干列的NA值替换(replace)为0

    R语言dplyr包将dataframe中的NA值替换(replace)为0实战:所有NA值替换(replace)为0.具体列的NA值替换(replace)为0.若干列的NA值替换(replace)为0 ...

最新文章

  1. P2949 [USACO09OPEN]Work Scheduling (后悔法,贪心)难度⭐⭐⭐
  2. django-pagination 样式修改
  3. GPU Gems翻译
  4. CSS网页制作布局实例教程
  5. 【NLP傻瓜式教程】手把手带你RCNN文本分类(附代码)
  6. java zmq zmsg,zmq 识别多个客户端
  7. java中多个输入框搜索_如何在一个搜索框中输入多个字段的值进行查询?
  8. java 格式化小数_java-如何格式化小数位数精度
  9. 【转】将QT开发的界面程序封装成DLL,在VC中成功调用
  10. Python实现恋爱AA公式
  11. 初次使用uwsgi:no python application found, check your startup logs for errors
  12. 美团 | 搜索多业务商品排序探索与实践
  13. go对get、post请求封装
  14. 【SQL Server】入门教程-基础篇(二)
  15. FreeSWITCH channel variables(信道变量)
  16. 计算机网络管理员绩效考核,绩效考核管理系统是如何发挥作用的?
  17. balabala视频格式转换器
  18. 微信公众号如何变得更好_五个使Windows变得更好的绝对必要的实用程序
  19. 测绘资质申请条件及办理资质要求
  20. [Java]窗口的跳转

热门文章

  1. 如何用python实现爬虫自动爬取百度图片原图
  2. lol1.7更新服务器维护,lol12月20日维护公告 v3.0.7.1版本更新内容一览
  3. 企业如何推动组织变革?
  4. java程序员创业需要_java程序员出路有哪些
  5. CyanogenMod 7
  6. 无人驾驶失败案例汇总
  7. 《孙子兵法》十三篇注译(1--导读)
  8. QQ三国七旗阵等级怎么计算?(附Excel计算器,输入队员等级即可计算)
  9. 理解openssl协议:x509、crt、cer、key、csr、ssl、tls 这些都是什么鬼? 如何给自己网站颁发证书?
  10. Python零基础学习笔记(三十三)—— 窗体的控制...