问题
I am trying to parse sample_attributes (preferably all) from the following xml file. Tried a couple of things but the XML gets clumped into one node:
xml.url <- "http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml"
xmlfile <- xmlTreeParse(xml.url)
xmltop = xmlRoot(xmlfile)
IBDcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
Also tried solutions mentioned here: How to parse XML to R data frame and how to create an R data frame from a xml file but when I try something like:
data <- xmlParse("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml")
xml_data <- xmlToList(data)
xmlToDataFrame(nodes=getNodeSet(data,"/SAMPLE_ATTRIBUTE"))[c("age","sex","body site","body-mass index")]
I get an error saying undefined columns selected
Any help will be appreciated thanks!
回答1:
At least for your second attempt, you just needed to select any SAMPLE_ATTRIBUTE node using //. Then subset by tag.
doc <- xmlParse(xml.url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
## OR
xmlToDataFrame(doc["//SAMPLE_ATTRIBUTE"])
TAG VALUE UNITS
1 investigation type metagenome <NA>
2 project name BMRP <NA>
3 experimental factor microbiome <NA>
4 target gene 16S rRNA <NA>
5 target subfragment V1V2 <NA>
...
subset(x, TAG %in% c("age","sex","body site","body-mass index") )
TAG VALUE UNITS
15 age 28 years
16 sex male <NA>
17 body site Sigmoid colon <NA>
19 body-mass index 16.9550173 <NA>
回答2:
Here's a tidyverse option; xml2 has a simple read_xml function that has an associated as_list function. purrr is a package for manipulating lists that is very handy, though you could, of course, do the same things in base R if you prefer.
library(xml2)
library(purrr)
x <- read_xml("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml")
x_list <- as_list(x)
x_df <- x_list %>% map('SAMPLE_ATTRIBUTES') %>% flatten() %>% map_df(flatten)
x_df
#> # A tibble: 35 × 3
#> TAG VALUE UNITS
#> <chr> <chr> <chr>
#> 1 investigation type metagenome <NA>
#> 2 project name BMRP <NA>
#> 3 experimental factor microbiome <NA>
#> 4 target gene 16S rRNA <NA>
#> 5 target subfragment V1V2 <NA>
#> 6 pcr primers 27F-338R <NA>
#> 7 multiplex identifiers TGATACGTCT <NA>
#> 8 sequencing method pyrosequencing <NA>
#> 9 sequence quality check software <NA>
#> 10 chimera check ChimeraSlayer; Usearch 4.1 database <NA>
#> # ... with 25 more rows
or do the subsetting in XPath instead:
x %>% xml_find_all('//SAMPLE_ATTRIBUTE') %>% map(as_list) %>% map_df(flatten)
which returns the same thing.
回答3:
Slightly different approach to the very creative one by @allistaire:
library(xml2)
doc <- read_xml("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml")
xml_find_all(doc, ".//SAMPLE_ATTRIBUTE") %>%
map(xml_children) %>%
map_df(~as.list(setNames(xml_text(.), xml_name(.))))
来源:https://stackoverflow.com/questions/41007689/xml-parser-in-r-with-hierarchical-nodes-tags-and-values