R XML - combining parent and child nodes into data frame

前端未结
关注
 3  1063
孤独总比滥情好 2020-12-06 15:47
I have xml like this:

      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   甜味超标
                                             
                
                
                (楼主)
            
              
              
                2020-12-06 16:24
              

            
            
                        
Here's options with xml2 for XML handling and the tidyverse for munging. The attributes (xml_attrs returns a named character vector), node names, and node values can be read into a three-element list that can be coerced to a data frame:

library(tidyverse)
library(xml2)

x <- read_xml('races.xml')

races <- x %>% 
    xml_find_all('//race') %>% 
    map_dfr(~list(attrs = list(xml_attrs(.x)), 
                  variable = list(map(xml_children(.x), xml_name)), 
                  value = list(map(xml_children(.x), xml_text))))

races
#> # A tibble: 29 x 3
#>    attrs     variable    value      
#>                   
#>  1   
#>  2   
#>  3   
#>  4   
#>  5   
#>  6   
#>  7   
#>  8   
#>  9   
#> 10   
#> # ... with 19 more rows

which can in turn be cleaned up with a lot of tidyr:
races_tidy <- races %>% 
    mutate(attr_names = map(attrs, names)) %>% 
    unnest(attr_names, attrs, .drop = FALSE) %>% 
    spread(attr_names, attrs) %>% 
    unnest(variable, value) %>% 
    unnest(variable, value) %>% 
    spread(variable, value) %>% 
    type_convert()    # fix variable types

This works, but the unnesting and spreading is fragile. Writing a more robust method is actually not too much more work, though, as you can just arrange the list columns before unnesting:

races_tidy2 <- races %>% 
    mutate(attrs = map(attrs, ~as_tibble(as.list(.x))), 
           data = map2(variable, value, ~as_tibble(set_names(.y, .x)))) %>% 
    unnest(attrs, data, .drop = TRUE) %>% 
    type_convert()

The most direct approach is to do the rearranging right while iterating over nodes. This is most concise and likely most efficient approach, but writing it correctly relies on careful manipulation of the data structures, so writing viable code may take longer.

races_tidy3 <- x %>% 
    xml_find_all('//race') %>% 
    map_dfr(~flatten(c(xml_attrs(.x), 
                       map(xml_children(.x), 
                           ~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
    type_convert()

races_tidy3
#> # A tibble: 29 x 21
#>        id perf… perf… deta… race… time  date       ampm  title type  dist…
#>                   
#>  1 692415         1 R     12:25 2018-01-13 pm    Adar… C     2m4f 
#>  2 692416         1 R     01:00 2018-01-13 pm    Tota… C     2m4f 
#>  3 692417         1 R     01:35 2018-01-13 pm    Conn… C     3m1f 
#>  4 692418         1 R     02:10 2018-01-13 pm    Sky … H     2m   
#>  5 692419         1 R     02:45 2018-01-13 pm    Spor… H     2m   
#>  6 692420         1 R     03:20 2018-01-13 pm    Lein… H     2m4f…
#>  7 692421         1 R     03:50 2018-01-13 pm    Davi… B     2m   
#>  8 691061         1 R     12:40 2018-01-13 pm    Betf… H     2m   
#>  9 691060         1 R     01:15 2018-01-13 pm    Betf… C     2m54y
#> 10 691058         1 R     01:50 2018-01-13 pm    Betf… C     3m   
#> # ... with 19 more rows, and 10 more variables: group , tipsAllowed
#> #   , predictorAllowed , bettingLink , declaredRunners
#> #   , liveCommentary , liveTab , raceDescription ,
#> #   tvText , betOffers 

All return the same data, though column order is different for races_tidy.
all_equal(races_tidy, races_tidy2)
#> [1] TRUE

identical(races_tidy2, races_tidy3)
#> [1] TRUE

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复