Tidying datasets with multiple sections/headers at variable positions

前端未结

关注

 4  1672

轻奢々 2021-01-22 16:33

Context

I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to

4条回答

忘掉有多难 (楼主)

2021-01-22 17:14

Here is an option based on creating a group based on the us.cities dataset from maps by matching the elements in 'city' with the 'name' column from 'us.cities' to create a group, and then create the first element of 'col1' as 'city', delete the first row (slice(-1))

library(maps)
library(dplyr)
library(stringr)
df %>% 
   group_by(grp = cumsum(str_detect(col1,str_c("\\b(", 
        str_c(word(us.cities$name, 1), collapse="|"), ")\\b")))) %>% 
   mutate(city = first(col1)) %>% 
   slice(-1) %>% 
   ungroup %>% 
   select(city, type = col1, value = col2)
# A tibble: 7 x 3
#  city    type     value
#         
#1 Seattle Diesel      80
#2 Seattle Gasoline    NA
#3 Seattle LPG         10
#4 Seattle Electric    10
#5 Boston  Diesel      65
#6 Boston  Gasoline    25
#7 Boston  Electric    10

Or another option is using str_extract instead of grouping and then fill as in the other post

df %>% 
   mutate(city = str_extract(col1, str_c("\\b(", 
     str_c(word(us.cities$name, 1), collapse="|"), ")\\b"))) %>% 
   fill(city) %>% 
   filter(col1 != city) %>% 
   select(city, type = col1, value = col2)

NOTE: This would also work if there are 100s of other elements in 'col1' besides the 'city'. Here, we considered only the US cities, if it also includes cities from other countries, use world.cities data from the same package

0 讨论(0)

查看其它4个回答