Parsing an XML SAX way in R

这一生的挚爱 提交于 2019-12-06 13:52:13

问题


Originating from this question, my research of R (and other) documentation indicates that SAX approach will be a faster way to parse XML data. Sadly I couldn't find much working examples for me to understand how to get there.

Here's a dummy file with information that I want parsed. The real thing would have substantially more <ITEM> nodes and other nodes all around the tree that I would like to exclude. Another peculiarity is that the <META> section has two <DESC> elements, and I need any one of them (not both).

<FILE>
  <HEADER>
    <FILEID>12347</FILEID>
  </HEADER>
  <META>
    <DESC>
      <TYPE>A</TYPE>
      <CODE>ABC</CODE>
      <VALUE>100000</VALUE>
    </DESC>
    <DESC>
      <TYPE>B</TYPE>
      <CODE>ABC</CODE>
      <VALUE>100000</VALUE>
    </DESC>
  </META>
  <BODY>
    <ITEM>
      <IVALUE>1000</IVALUE>
      <ICODE>CDF</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
    <ITEM>
      <IVALUE>1500</IVALUE>
      <ICODE>EGK</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
    <ITEM>
      <IVALUE>300</IVALUE>
      <ICODE>TSR</ICODE>
      <ITYPE>R</ITYPE>
    </ITEM>
  </BODY>
</FILE>

For the example XML above I'm looking to get

> data.table(fileid=12347, code="ABC", value=10000, ivalue=c(1000,1500,300), icode=c("CDF","EGK","TSR"), itype="R")
#    fileid code value ivalue icode itype
# 1:  12347  ABC 10000   1000   CDF     R
# 2:  12347  ABC 10000   1500   EGK     R
# 3:  12347  ABC 10000    300   TSR     R    

Could anyone with SAX experience guide me to building a parser to suit my needs with xmlEventParse()?


回答1:


The Simple API for XML might improve the speed in parsing the XML data vs. another approach, but generally using SAX will not give you better results than XPath for example. On the contrary, for bigger files, it will allow not to load the complete tree in R, and thus avoid potential memory leaks.

For using SAX, you can use the below code example, which is based on the xmlEventParse branches (one branch per data you want to retrieve):

#a file to read with xmlEventParse
xmlDoc <- "example.xml"

desc <- NULL
items <- NULL

#function to use with xmlEventParse
row.sax = function() {

    #SAX function for Meta 'DESC'
    DESC = function(node){
        children <- xmlChildren(node)
        children[which(names(children) == "text")] <- NULL
        desc <<- rbind(desc, sapply(children,xmlValue))
    }

    #SAX function for Body 'ITEM'
    ITEM = function(node){
        children <- xmlChildren(node)
        children[which(names(children) == "text")] <- NULL
        items <<- rbind(items, sapply(children,xmlValue))
    }

    branches <- list(DESC = DESC, ITEM = ITEM)
    return(branches)
}

#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
              saxVersion = 2, trim = FALSE)

#processing the result as data.frame
desc <- as.data.frame(desc, stringsAsFactors = F)
desc <- desc[rep(row.names(desc[1,]), nrow(items)),]

items <- as.data.frame(items, stringsAsFactors = F)

result <- cbind(desc, items)
row.names(result) <- 1:nrow(result)

Let me know if it works for you




回答2:


May be something like this?

library(rvest)
library(data.table)


test<-read_html("test.html") 
    data.table(do.call(cbind,lapply(c("fileid","code","value","ivalue","icode","itype"),function(i){
        test %>%
        html_nodes(i)%>%
        html_text()


    })))

         V1  V2     V3   V4  V5 V6
    1: 12347 ABC 100000 1000 CDF  R
    2: 12347 ABC 100000 1500 EGK  R
    3: 12347 ABC 100000  300 TSR  R


来源:https://stackoverflow.com/questions/31004615/parsing-an-xml-sax-way-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!