Storing XML node values with R's xmlEventParse for filtered output

青春壹個敷衍的年華 提交于 2019-12-10 17:11:09

问题


I have a huge xml file (260mb) with tons of information looking like this:

Example:

<mydocument>
<POSITIONS EventTime="2012-09-29T20:31:21" InternalMatchId="0000T0">
<FrameSet GameSection="1sthalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
<FrameSet GameSection="2ndhalf" Match="0000T0" Club="REFEREE" Object="00011D">
<Frame N="0" T="2012-09-29T18:31:21" X="-0.1158" Y="0.2347" S="1.27" />
<Frame N="1" T="2012-09-29T18:31:21.196" X="-0.1146" Y="0.2351" S="1.3" />
<Frame N="2" T="2012-09-29T18:31:21.243" X="-0.1134" Y="0.2356" S="1.33" />
</FrameSet>
</POSITIONS>
</mydocument>

there are around 40 different FrameSet nodes, each with a different GameSection="..." and Object="...".

I would love to extract the information of the <Frame> nodes into a list object but I cannot load the whole xml file because it is too large. Is there any way, I can use the xmlEventParse function to filter for a specific GameSection and a specific Object and get all the information from the corresponding <Frame> elements?


回答1:


It might be that the 'internal' representation is not that large

xml = xmlTreeParse("file.xml", useInternalNodes=TRUE)

and then xpath will definitely be your best bet. If that doesn't work, you'll need to get your head around closures. I'm going to aim for the branches argument of xmlEventParse, which allows a hybrid event parsing to iterate through the file, coupled with DOM parsing on each node. Here's a function that returns a list of functions.

branchFactory <-
    function()
{
    env <- new.env(parent=emptyenv())   # safety

    FrameSet <- function(elt) {
        id <- paste(xmlAttrs(elt), collapse=":")
        env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
    }

    get <- function() env

    list(get=get, FrameSet=FrameSet)
}

Inside this function we're going to create a place to store our results as we iterate through the file. This could be a list, but it'll be better to use an environment. This will allow us to insert new results without copying all the results that we've already inserted. So here's our environment:

    env <- new.env(parent=emptyenv())

we use the parent argument as a measure of safety, even if it's not relevant in our present case. Now we define a function that will be invoked whenever a "FrameSet" node is encountered

    FrameSet <- function(elt) {
        id <- paste(xmlAttrs(elt), collapse=":")
        env[[id]] <- xpathSApply(elt, "//Frame", xmlAttrs)
    }

It turns out that, when we use the branches argument, the xmlEventParse will have arranged to parse the entire node into an object that we can manipulate via the DOM, e.g., using xlmAttrs and xpathSApply. The first line of this function creates a unique identifier for this frame set (? maybe that's not the case for the full data set? You'll need a unique identifier). we then parse the "//Frame" part of the element, and store that in our environment. Storing the result is trickier than it looks -- we're assigning to a variable called env. env doesn't exist in the body of the FrameSet function, so R uses its lexical scoping rules to search for a variable named env in the environment in which the FrameSet function was defined. And lo, it finds the env that we have already created. This is where we add the result of xpathSApply to. That's it for our FrameSet node parser.

We'd also like a convenience function that we can use to retrieve env, like this:

    get <- function() env

Again, this is going to use lexical scoping to find the env variable created at the top of branchFactory. We end branchFactory by returning a list of the functions that we've defined

    list(get=get, FrameSet=FrameSet)

This too is surprisingly tricky -- we're returning a list of functions. The functions are defined in the environment created when we invoke branchFactory and, for lexical scope to work, the environment has to persist. So actually we're returning not only the list of functions, but also, implicitly, the variable env. In brief

We're now ready to parse our file. Do this by creating an instance of the branch parser, with it's own unique versions of the get and FrameSet functions and of the env variable created to store results. Then parse the file

b <- branchFactory()
xx <- xmlEventParse("file.xml", handlers=list(), branches=b)

We can retrieve the results using b$get(), and can cast this to a list if that's convenient.

> as.list(b$get())
$`1sthalf:0000T0:REFEREE:00011D`
  [,1]                  [,2]                  [,3]                 
N "0"                   "1"                   "2"                  
T "2012-09-29T18:31:21" "2012-09-29T18:31:21" "2012-09-29T18:31:21"
X "-0.1158"             "-0.1146"             "-0.1134"            
Y "0.2347"              "0.2351"              "0.2356"             
S "1.27"                "1.3"                 "1.33"               

$`2ndhalf:0000T0:REFEREE:00011D`
  [,1]                  [,2]                      [,3]                     
N "0"                   "1"                       "2"                      
T "2012-09-29T18:31:21" "2012-09-29T18:31:21.196" "2012-09-29T18:31:21.243"
X "-0.1158"             "-0.1146"                 "-0.1134"                
Y "0.2347"              "0.2351"                  "0.2356"                 
S "1.27"                "1.3"                     "1.33"                   


来源:https://stackoverflow.com/questions/16676798/storing-xml-node-values-with-rs-xmleventparse-for-filtered-output

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!