问题
Originating from this question, my research of R (and other) documentation indicates that SAX approach will be a faster way to parse XML data. Sadly I couldn't find much working examples for me to understand how to get there.
Here's a dummy file with information that I want parsed. The real thing would have substantially more <ITEM>
nodes and other nodes all around the tree that I would like to exclude. Another peculiarity is that the <META>
section has two <DESC>
elements, and I need any one of them (not both).
<FILE>
<HEADER>
<FILEID>12347</FILEID>
</HEADER>
<META>
<DESC>
<TYPE>A</TYPE>
<CODE>ABC</CODE>
<VALUE>100000</VALUE>
</DESC>
<DESC>
<TYPE>B</TYPE>
<CODE>ABC</CODE>
<VALUE>100000</VALUE>
</DESC>
</META>
<BODY>
<ITEM>
<IVALUE>1000</IVALUE>
<ICODE>CDF</ICODE>
<ITYPE>R</ITYPE>
</ITEM>
<ITEM>
<IVALUE>1500</IVALUE>
<ICODE>EGK</ICODE>
<ITYPE>R</ITYPE>
</ITEM>
<ITEM>
<IVALUE>300</IVALUE>
<ICODE>TSR</ICODE>
<ITYPE>R</ITYPE>
</ITEM>
</BODY>
</FILE>
For the example XML above I'm looking to get
> data.table(fileid=12347, code="ABC", value=10000, ivalue=c(1000,1500,300), icode=c("CDF","EGK","TSR"), itype="R")
# fileid code value ivalue icode itype
# 1: 12347 ABC 10000 1000 CDF R
# 2: 12347 ABC 10000 1500 EGK R
# 3: 12347 ABC 10000 300 TSR R
Could anyone with SAX
experience guide me to building a parser to suit my needs with xmlEventParse()
?
回答1:
The Simple API for XML might improve the speed in parsing the XML data vs. another approach, but generally using SAX will not give you better results than XPath for example. On the contrary, for bigger files, it will allow not to load the complete tree in R, and thus avoid potential memory leaks.
For using SAX, you can use the below code example, which is based on the xmlEventParse
branches (one branch per data you want to retrieve):
#a file to read with xmlEventParse
xmlDoc <- "example.xml"
desc <- NULL
items <- NULL
#function to use with xmlEventParse
row.sax = function() {
#SAX function for Meta 'DESC'
DESC = function(node){
children <- xmlChildren(node)
children[which(names(children) == "text")] <- NULL
desc <<- rbind(desc, sapply(children,xmlValue))
}
#SAX function for Body 'ITEM'
ITEM = function(node){
children <- xmlChildren(node)
children[which(names(children) == "text")] <- NULL
items <<- rbind(items, sapply(children,xmlValue))
}
branches <- list(DESC = DESC, ITEM = ITEM)
return(branches)
}
#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
saxVersion = 2, trim = FALSE)
#processing the result as data.frame
desc <- as.data.frame(desc, stringsAsFactors = F)
desc <- desc[rep(row.names(desc[1,]), nrow(items)),]
items <- as.data.frame(items, stringsAsFactors = F)
result <- cbind(desc, items)
row.names(result) <- 1:nrow(result)
Let me know if it works for you
回答2:
May be something like this?
library(rvest)
library(data.table)
test<-read_html("test.html")
data.table(do.call(cbind,lapply(c("fileid","code","value","ivalue","icode","itype"),function(i){
test %>%
html_nodes(i)%>%
html_text()
})))
V1 V2 V3 V4 V5 V6
1: 12347 ABC 100000 1000 CDF R
2: 12347 ABC 100000 1500 EGK R
3: 12347 ABC 100000 300 TSR R
来源:https://stackoverflow.com/questions/31004615/parsing-an-xml-sax-way-in-r