I have xml like this:
Here's options with xml2 for XML handling and the tidyverse for munging. The attributes (xml_attrs returns a named character vector), node names, and node values can be read into a three-element list that can be coerced to a data frame:
library(tidyverse)
library(xml2)
x <- read_xml('races.xml')
races <- x %>%
xml_find_all('//race') %>%
map_dfr(~list(attrs = list(xml_attrs(.x)),
variable = list(map(xml_children(.x), xml_name)),
value = list(map(xml_children(.x), xml_text))))
races
#> # A tibble: 29 x 3
#> attrs variable value
#>
#> 1
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9
#> 10
#> # ... with 19 more rows
which can in turn be cleaned up with a lot of tidyr:
races_tidy <- races %>%
mutate(attr_names = map(attrs, names)) %>%
unnest(attr_names, attrs, .drop = FALSE) %>%
spread(attr_names, attrs) %>%
unnest(variable, value) %>%
unnest(variable, value) %>%
spread(variable, value) %>%
type_convert() # fix variable types
This works, but the unnesting and spreading is fragile. Writing a more robust method is actually not too much more work, though, as you can just arrange the list columns before unnesting:
races_tidy2 <- races %>%
mutate(attrs = map(attrs, ~as_tibble(as.list(.x))),
data = map2(variable, value, ~as_tibble(set_names(.y, .x)))) %>%
unnest(attrs, data, .drop = TRUE) %>%
type_convert()
The most direct approach is to do the rearranging right while iterating over nodes. This is most concise and likely most efficient approach, but writing it correctly relies on careful manipulation of the data structures, so writing viable code may take longer.
races_tidy3 <- x %>%
xml_find_all('//race') %>%
map_dfr(~flatten(c(xml_attrs(.x),
map(xml_children(.x),
~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
type_convert()
races_tidy3
#> # A tibble: 29 x 21
#> id perf… perf… deta… race… time date ampm title type dist…
#>
#> 1 692415 1 R 12:25 2018-01-13 pm Adar… C 2m4f
#> 2 692416 1 R 01:00 2018-01-13 pm Tota… C 2m4f
#> 3 692417 1 R 01:35 2018-01-13 pm Conn… C 3m1f
#> 4 692418 1 R 02:10 2018-01-13 pm Sky … H 2m
#> 5 692419 1 R 02:45 2018-01-13 pm Spor… H 2m
#> 6 692420 1 R 03:20 2018-01-13 pm Lein… H 2m4f…
#> 7 692421 1 R 03:50 2018-01-13 pm Davi… B 2m
#> 8 691061 1 R 12:40 2018-01-13 pm Betf… H 2m
#> 9 691060 1 R 01:15 2018-01-13 pm Betf… C 2m54y
#> 10 691058 1 R 01:50 2018-01-13 pm Betf… C 3m
#> # ... with 19 more rows, and 10 more variables: group , tipsAllowed
#> # , predictorAllowed , bettingLink , declaredRunners
#> # , liveCommentary , liveTab , raceDescription ,
#> # tvText , betOffers
All return the same data, though column order is different for races_tidy.
all_equal(races_tidy, races_tidy2)
#> [1] TRUE
identical(races_tidy2, races_tidy3)
#> [1] TRUE