R XML - combining parent and child nodes into data frame

前端 未结 3 1063
孤独总比滥情好
孤独总比滥情好 2020-12-06 15:47

I have xml like this:




      

        
3条回答
  •  甜味超标
    2020-12-06 16:24

    Here's options with xml2 for XML handling and the tidyverse for munging. The attributes (xml_attrs returns a named character vector), node names, and node values can be read into a three-element list that can be coerced to a data frame:

    library(tidyverse)
    library(xml2)
    
    x <- read_xml('races.xml')
    
    races <- x %>% 
        xml_find_all('//race') %>% 
        map_dfr(~list(attrs = list(xml_attrs(.x)), 
                      variable = list(map(xml_children(.x), xml_name)), 
                      value = list(map(xml_children(.x), xml_text))))
    
    races
    #> # A tibble: 29 x 3
    #>    attrs     variable    value      
    #>                   
    #>  1   
    #>  2   
    #>  3   
    #>  4   
    #>  5   
    #>  6   
    #>  7   
    #>  8   
    #>  9   
    #> 10   
    #> # ... with 19 more rows
    

    which can in turn be cleaned up with a lot of tidyr:

    races_tidy <- races %>% 
        mutate(attr_names = map(attrs, names)) %>% 
        unnest(attr_names, attrs, .drop = FALSE) %>% 
        spread(attr_names, attrs) %>% 
        unnest(variable, value) %>% 
        unnest(variable, value) %>% 
        spread(variable, value) %>% 
        type_convert()    # fix variable types
    

    This works, but the unnesting and spreading is fragile. Writing a more robust method is actually not too much more work, though, as you can just arrange the list columns before unnesting:

    races_tidy2 <- races %>% 
        mutate(attrs = map(attrs, ~as_tibble(as.list(.x))), 
               data = map2(variable, value, ~as_tibble(set_names(.y, .x)))) %>% 
        unnest(attrs, data, .drop = TRUE) %>% 
        type_convert()
    

    The most direct approach is to do the rearranging right while iterating over nodes. This is most concise and likely most efficient approach, but writing it correctly relies on careful manipulation of the data structures, so writing viable code may take longer.

    races_tidy3 <- x %>% 
        xml_find_all('//race') %>% 
        map_dfr(~flatten(c(xml_attrs(.x), 
                           map(xml_children(.x), 
                               ~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
        type_convert()
    
    races_tidy3
    #> # A tibble: 29 x 21
    #>        id perf… perf… deta… race… time  date       ampm  title type  dist…
    #>                   
    #>  1 692415         1 R     12:25 2018-01-13 pm    Adar… C     2m4f 
    #>  2 692416         1 R     01:00 2018-01-13 pm    Tota… C     2m4f 
    #>  3 692417         1 R     01:35 2018-01-13 pm    Conn… C     3m1f 
    #>  4 692418         1 R     02:10 2018-01-13 pm    Sky … H     2m   
    #>  5 692419         1 R     02:45 2018-01-13 pm    Spor… H     2m   
    #>  6 692420         1 R     03:20 2018-01-13 pm    Lein… H     2m4f…
    #>  7 692421         1 R     03:50 2018-01-13 pm    Davi… B     2m   
    #>  8 691061         1 R     12:40 2018-01-13 pm    Betf… H     2m   
    #>  9 691060         1 R     01:15 2018-01-13 pm    Betf… C     2m54y
    #> 10 691058         1 R     01:50 2018-01-13 pm    Betf… C     3m   
    #> # ... with 19 more rows, and 10 more variables: group , tipsAllowed
    #> #   , predictorAllowed , bettingLink , declaredRunners
    #> #   , liveCommentary , liveTab , raceDescription ,
    #> #   tvText , betOffers 
    

    All return the same data, though column order is different for races_tidy.

    all_equal(races_tidy, races_tidy2)
    #> [1] TRUE
    
    identical(races_tidy2, races_tidy3)
    #> [1] TRUE
    

提交回复
热议问题