Storing specific XML node values with R's xmlEventParse

浪子不回头ぞ 提交于 2019-11-28 08:45:28

I'll use fileName from example(xmlEventParse) as a reproducible example. It has tags record that have an attribute id and text that we'd like to extract. Rather than use handler, I'll go after the branches argument. This is like a handler, but one has access to the full node rather than just the element. The idea is to write a closure that has a place to keep the data we accumulate, and a function to process each branch of the XML document we are interested in. So let's start by defining the closure -- for our purposes, a function that returns a list of functions

ourBranches <- function() {

We need a place to store the results we accumulate, choosing an environment so that the insertion times are constant (not a list, which we would have to append to and would be memory inefficient)

    store <- new.env() 

The event parser is expecting a list of functions to be invoked when a matching tag is discovered. We're interested in the record tag. The function we write will receive a node of the XML document. We want to extract an element id that we'll use to store the (text) values in the node. We add these to our store.

    record <- function(x, ...) {
        key <- xmlAttrs(x)[["id"]]
        value <- xmlValue(x)
        store[[key]] <- value
    }

Once the document is processed, we'd like a convenient way to retrieve our results, so we add a function for our own purposes, independent of nodes in the document

    getStore <- function() as.list(store)

and then finish the closure by returning a list of functions

    list(record=record, getStore=getStore)
}

A tricky concept here is that the environment in which a function is defined is part of the function, so each time we say ourBranches() we get a list of functions and a new environment store to keep our results. To use, invoke xmlEventParse on our file, with an empty set of event handlers, and access our accumulated store.

> branches <- ourBranches()
> xmlEventParse(fileName, list(), branches=branches)
list()
> head(branches$getStore(), 2)
$`Hornet Sportabout`
[1] "18.7   8 360.0 175 3.15 3.440 17.02  0  0    3 "

$`Toyota Corolla`
[1] "33.9   4  71.1  65 4.22 1.835 19.90  1  1    4 "

For others who may try to lear from M.Morgan - here is the complete code

fileName = system.file("exampleData", "mtcars.xml", package = "XML")

ourBranches <- function() {
  store <- new.env() 
  record <- function(x, ...) {
    key <- xmlAttrs(x)[["id"]]
    value <- xmlValue(x)
    store[[key]] <- value
  }
  getStore <- function() as.list(store)
  list(record=record, getStore=getStore)
}

branches <- ourBranches()
xmlEventParse(fileName, list(), branches=branches)
head(branches$getStore(), 2)

The branches method does not preserve the order of the events. In other words, the order of 'record' in branches$getStore() stores is different from that in the original xml file. On the other hand, the handler methods can preserve the order. Here is the code:

fileName <- system.file("exampleData", "mtcars.xml", package="XML")
records <- new('list')
variable <- new('character')
tag.open <- new('character')
nvar <- 0
xmlEventParse(fileName, list(startElement = function (name, attrs) {
  tagName <<- name
  tag.open <<- c(name, tag.open)
  if (length(attrs)) {
    attributes(tagName) <<- as.list(attrs)
  }
}, text = function (x) {
  if (nchar(x) > 0) {
    if (tagName == "record") {
      record <- list()
      record[[attributes(tagName)$id]] <- x
      records <<- c(records, record)
    } else {
      if( tagName == 'variable') {
        v <- x
        variable <<- c( variable, v)
        nvar <<- nvar + 1
      }
    }
  }
}, endElement = function (name) {
  if( name == 'record') {
    print(paste(tag.open, collapse='>'))
  }
  tag.open <<- tag.open[-1]
}))

head(records,2)
$``Mazda RX4``
[1] "21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4"

$`Mazda RX4 Wag`
[1] "21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4"

variable
[1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Another benefit of using handlers is that one can capture hierarchical structure. In other words, it is possible to save the ancestors as well. One of the key points of this process is the use of global variables, which can be assigned with "<<-", instead of "<-".

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!