tidyr

Tidyr(https://tidyr.tidyverse.org/) 是数据处理经常使用的package,主要用数据形状的改变,处理缺失值等,一般情况作为dplyr的上游处理步骤。

主要功能

  • 长宽数据转换 (pivot_longer pivot_wider)
  • 数据分割与合并 (separate unite)
  • 缺失数据处理 (drop_na fill)
  • 嵌套数据处理 (nest unnest unnest_wider hoist)

长宽数据转换

library(dplyr)
library(tibble)
library(tidyr)
library(knitr)

mtcars = mtcars %>% rownames_to_column('name')

print_df = function(df){
  kable(df)
}

print_df(head(mtcars))
name mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

宽变长 pivot_longer

这里相当于是把数据变窄变长了

long_data = mtcars %>% 
  pivot_longer(
    -name,
    names_to = 'feature',
    values_to = 'value'
  )

print_df(head(long_data))
name feature value
Mazda RX4 mpg 21.00
Mazda RX4 cyl 6.00
Mazda RX4 disp 160.00
Mazda RX4 hp 110.00
Mazda RX4 drat 3.90
Mazda RX4 wt 2.62

可以看到原来的属性名mpg、cyl等都被合并到了feature列,对应的值都被合并到了value列。而且信息是一样的,没有损失。
所以pivot_longer可以理解为把数据变长,那变长会吧原来的列整合成新的列,所以需要指定新的列名和值的列名。

长变宽 pivot_wider

进行与上相反的操作,把刚才变长的数据恢复原来的样子。

origin_data = long_data %>% 
  pivot_wider(
    names_from = feature,
    values_from = value
  )

print_df(head(origin_data))
name mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

这里变宽可以理解为把一列拆分成多列,因此需要传入参数新列名称的来源,新列值的来源。
细心的同学应该还会看到,pivot_wider里传递变量的名字和pivot_longer不同,后者变量名加了引号,可以简单记忆成如果该列在原有数据中已经存在就不需引号,否则加引号。

更多长宽转换例子

scores = data.frame(
  school = c("A", "B", "A" ,"B" ,"A", "B"),
  grade = c("grade_1","grade_2","grade_2","grade_2","grade_1","grade_1"),
  gender = c("F", "F", "M", "M", "M", "F"),
  eng = c(90, 98, 78, 73, 82, 94),
  math = c(99, 100, 87, 89, 93, 99),
  physical = c(76, 78, 82, 83, 68, 74)
)

print_df(scores)
school grade gender eng math physical
A grade_1 F 90 99 76
B grade_2 F 98 100 78
A grade_2 M 78 87 82
B grade_2 M 73 89 83
A grade_1 M 82 93 68
B grade_1 F 94 99 74

把学生成绩合并成一列

long_data =  scores %>% 
    pivot_longer(
      -c(school, grade, gender),
      names_to = "subject",
      values_to = "score"
    )

print_df(long_data)
school grade gender subject score
A grade_1 F eng 90
A grade_1 F math 99
A grade_1 F physical 76
B grade_2 F eng 98
B grade_2 F math 100
B grade_2 F physical 78
A grade_2 M eng 78
A grade_2 M math 87
A grade_2 M physical 82
B grade_2 M eng 73
B grade_2 M math 89
B grade_2 M physical 83
A grade_1 M eng 82
A grade_1 M math 93
A grade_1 M physical 68
B grade_1 F eng 94
B grade_1 F math 99
B grade_1 F physical 74
# 或者通过使用下面的方式实现同样的效果,和上面的区别在于前者是指定保留哪些列,这里是指定保留哪些列
# 指定的保持不变的列不会改变,只会在原有基础上进行(xing)行(hang)的复制扩充
# scores %>% 
#     pivot_longer(
#       c(eng, math, physical),
#       names_to = "subject",
#       values_to = "score"
#     )

把学校信息拆分成两列

print_df(
  
  long_data %>% 
  pivot_wider(
    names_from = school,
    values_from = score
  )
  
)
grade gender subject A B
grade_1 F eng 90 94
grade_1 F math 99 99
grade_1 F physical 76 74
grade_2 F eng NA 98
grade_2 F math NA 100
grade_2 F physical NA 78
grade_2 M eng 78 73
grade_2 M math 87 89
grade_2 M physical 82 83
grade_1 M eng 82 NA
grade_1 M math 93 NA
grade_1 M physical 68 NA

把年级信息拆分成两列

print_df(
  
  long_data %>% 
  pivot_wider(
    names_from = grade,
    values_from = score
  )
  
)
school gender subject grade_1 grade_2
A F eng 90 NA
A F math 99 NA
A F physical 76 NA
B F eng 94 98
B F math 99 100
B F physical 74 78
A M eng 82 78
A M math 93 87
A M physical 68 82
B M eng NA 73
B M math NA 89
B M physical NA 83

把学校和年级信息拆分成两列

print_df(
  
  long_data %>% 
  pivot_wider(
    names_from = c(school, grade),
    values_from = score
  )
  
)
gender subject A_grade_1 B_grade_2 A_grade_2 B_grade_1
F eng 90 98 NA 94
F math 99 100 NA 99
F physical 76 78 NA 74
M eng 82 73 78 NA
M math 93 89 87 NA
M physical 68 83 82 NA

把学校和年级信息拆分成两列,使用names_glue参数自定义列名规则

print_df(
  
  long_data %>% 
  pivot_wider(
    names_from = c(school, grade),
    values_from = score,
    names_glue = "{grade}@{school}"
  )
  
)
gender subject grade_1@A grade_2@B grade_2@A grade_1@B
F eng 90 98 NA 94
F math 99 100 NA 99
F physical 76 78 NA 74
M eng 82 73 78 NA
M math 93 89 87 NA
M physical 68 83 82 NA

数据分割与合并

数据分割

把原来的grade分割成两列, 新的grade只保留数字

print_df(
  
  scores %>% 
  separate(grade, into = c("trash", "grade"))
  
)
school trash grade gender eng math physical
A grade 1 F 90 99 76
B grade 2 F 98 100 78
A grade 2 M 78 87 82
B grade 2 M 73 89 83
A grade 1 M 82 93 68
B grade 1 F 94 99 74

数据合并

new_data = scores %>% 
    unite("score", eng, math, physical) %>% 
    mutate(subject = "eng_math_physical")

print_df(
  new_data
)
school grade gender score subject
A grade_1 F 90_99_76 eng_math_physical
B grade_2 F 98_100_78 eng_math_physical
A grade_2 M 78_87_82 eng_math_physical
B grade_2 M 73_89_83 eng_math_physical
A grade_1 M 82_93_68 eng_math_physical
B grade_1 F 94_99_74 eng_math_physical

数据分割

针对上面的数据,如果不知道score是几个科目的拼接时,由于separate需要指定分割之后的名称,这里就会很麻烦。尤其是当有的学生有4科成绩,有的是3科时更麻烦,不过separate_rows()能够解决这个问题。
这个函数能够把两列按照同样规则拼接起来的数据分割成多行。 仔细观察这个数据和上面数据的对应关系。

print_df(
  
  new_data %>% 
  separate_rows(subject, score)
  
)
school grade gender score subject
A grade_1 F 90 eng
A grade_1 F 99 math
A grade_1 F 76 physical
B grade_2 F 98 eng
B grade_2 F 100 math
B grade_2 F 78 physical
A grade_2 M 78 eng
A grade_2 M 87 math
A grade_2 M 82 physical
B grade_2 M 73 eng
B grade_2 M 89 math
B grade_2 M 83 physical
A grade_1 M 82 eng
A grade_1 M 93 math
A grade_1 M 68 physical
B grade_1 F 94 eng
B grade_1 F 99 math
B grade_1 F 74 physical

缺失数据处理

new_data =   long_data %>% 
  pivot_wider(
    names_from = grade,
    values_from = score
  )

print_df(
  new_data
)
school gender subject grade_1 grade_2
A F eng 90 NA
A F math 99 NA
A F physical 76 NA
B F eng 94 98
B F math 99 100
B F physical 74 78
A M eng 82 78
A M math 93 87
A M physical 68 82
B M eng NA 73
B M math NA 89
B M physical NA 83

可以看到这里有些数据是缺失的

删除缺失的行

默认只要一行出现缺失值就删除

print_df(
  new_data %>% 
  drop_na()
)
school gender subject grade_1 grade_2
B F eng 94 98
B F math 99 100
B F physical 74 78
A M eng 82 78
A M math 93 87
A M physical 68 82

指定按照某些列是否缺失进行删除

print_df(
  new_data %>% 
  drop_na(grade_1)
)
school gender subject grade_1 grade_2
A F eng 90 NA
A F math 99 NA
A F physical 76 NA
B F eng 94 98
B F math 99 100
B F physical 74 78
A M eng 82 78
A M math 93 87
A M physical 68 82

填充缺失值

使用fill填充缺失值,可以规定填充的方向。 “down”代表使用上面的填充,“up”代表使用下面的填充

print_df(
new_data %>% 
  fill(grade_1, .direction = "down")
)
school gender subject grade_1 grade_2
A F eng 90 NA
A F math 99 NA
A F physical 76 NA
B F eng 94 98
B F math 99 100
B F physical 74 78
A M eng 82 78
A M math 93 87
A M physical 68 82
B M eng 68 73
B M math 68 89
B M physical 68 83

替换缺失值

分别使用各自平均分替换缺失值

mean_grade_1 = mean(new_data$grade_1, na.rm = T)
mean_grade_2 = mean(new_data$grade_2, na.rm = T)


print_df(
  
  new_data %>% 
  replace_na(list(
    grade_1 = mean_grade_1, 
    grade_2 = mean_grade_2
  ))
  
)
school gender subject grade_1 grade_2
A F eng 90.00000 85.33333
A F math 99.00000 85.33333
A F physical 76.00000 85.33333
B F eng 94.00000 98.00000
B F math 99.00000 100.00000
B F physical 74.00000 78.00000
A M eng 82.00000 78.00000
A M math 93.00000 87.00000
A M physical 68.00000 82.00000
B M eng 86.11111 73.00000
B M math 86.11111 89.00000
B M physical 86.11111 83.00000