How to read large (~20 GB) xml file in R?

后端 未结 1 933
臣服心动
臣服心动 2020-12-06 19:01

I want to read data from large xml file (20 GB) and manipulate them. I tired to use \"xmlParse()\" but it gave me memory issue before loading. Is there any efficient way to

相关标签:
1条回答
  • 2020-12-06 20:02

    In XML package the xmlEventParse function implements SAX (reading XML and calling your function handlers). If your XML is simple enough (repeating elements inside one root element), you can use branches parameter to define function(s) for every element.

    Example:

    MedlineCitation = function(x, ...) {
      #This is a "branch" function
      #x is a XML node - everything inside element <MedlineCitation>
      # find element <ArticleTitle> inside and print it:
      ns <- getNodeSet(x,path = "//ArticleTitle")
      value <- xmlValue(ns[[1]])
      print(value)
    }
    

    Call XML parsing:

    xmlEventParse(
      file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml", 
      handlers = NULL, 
      branches = list(MedlineCitation = MedlineCitation)
    )
    

    Solution with closure:

    Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse:

    branchFunction <- function() {
      store <- new.env() 
      func <- function(x, ...) {
        ns <- getNodeSet(x, path = "//ArticleTitle")
        value <- xmlValue(ns[[1]])
        print(value)
        # if storing something ... 
        # store[[some_key]] <- some_value
      }
      getStore <- function() { as.list(store) }
      list(MedlineCitation = func, getStore=getStore)
    }
    
    myfunctions <- branchFunction()
    
    xmlEventParse(
      file = "medsamp2015.xml", 
      handlers = NULL, 
      branches = myfunctions
    )
    
    #to see what is inside
    myfunctions$getStore()
    
    0 讨论(0)
提交回复
热议问题