Tidying datasets with multiple sections/headers at variable positions

前端 未结 4 1672
轻奢々
轻奢々 2021-01-22 16:33

Context

I am trying to read in and tidy an excel file with multiple headers/sections placed at variable positions. The content of these headers need to

4条回答
  •  忘掉有多难
    2021-01-22 17:14

    Here is an option based on creating a group based on the us.cities dataset from maps by matching the elements in 'city' with the 'name' column from 'us.cities' to create a group, and then create the first element of 'col1' as 'city', delete the first row (slice(-1))

    library(maps)
    library(dplyr)
    library(stringr)
    df %>% 
       group_by(grp = cumsum(str_detect(col1,str_c("\\b(", 
            str_c(word(us.cities$name, 1), collapse="|"), ")\\b")))) %>% 
       mutate(city = first(col1)) %>% 
       slice(-1) %>% 
       ungroup %>% 
       select(city, type = col1, value = col2)
    # A tibble: 7 x 3
    #  city    type     value
    #         
    #1 Seattle Diesel      80
    #2 Seattle Gasoline    NA
    #3 Seattle LPG         10
    #4 Seattle Electric    10
    #5 Boston  Diesel      65
    #6 Boston  Gasoline    25
    #7 Boston  Electric    10
    

    Or another option is using str_extract instead of grouping and then fill as in the other post

    df %>% 
       mutate(city = str_extract(col1, str_c("\\b(", 
         str_c(word(us.cities$name, 1), collapse="|"), ")\\b"))) %>% 
       fill(city) %>% 
       filter(col1 != city) %>% 
       select(city, type = col1, value = col2)
    

    NOTE: This would also work if there are 100s of other elements in 'col1' besides the 'city'. Here, we considered only the US cities, if it also includes cities from other countries, use world.cities data from the same package

提交回复
热议问题