R dataframe from XML when values are multiple or missing

前端 未结 4 1247
盖世英雄少女心
盖世英雄少女心 2020-12-16 08:40

This question is similar to a previous question, Import all fields (and subfields) of XML as dataframe, but I want to pull out only a subset of the XML data and want to incl

相关标签:
4条回答
  • 2020-12-16 08:50

    You can use xmlToList and then plyr to get a dataframe you can use

    require(XML)
    require(plyr)
    xD <- xmlParse(xData)
    xL <- xmlToList(xD)
    ldply(xL, data.frame)
    > ldply(xL, data.frame)
       .id     name buildings.building.type buildings.building.bname
    1 city   London                landmark             Tower Bridge
    2 city New York                 station            Grand Central
    3 city    Paris                landmark             Eiffel Tower
      buildings.building.type.1 buildings.building.bname.1
    1                   station                   Waterloo
    2                      <NA>                       <NA>
    3                  landmark                     Louvre
    

    You can pick what you need from this dataframe

    0 讨论(0)
  • 2020-12-16 09:01

    There is a solution xpathSapply but writing the xpath here is a little bit complicated. So, Here I propose a solution using xmlToDataFrame and using some regular expression to get the buildings.

    dd <- xmlToDataFrame(doc)
    rr <- gsub('landmark',',',dd$buildings)
    rr <- gsub('station.*','',rr)
    builds <- lapply(strsplit(gsub('station.*','',rr),','),
                     function(x)x[nchar(x)>0])
    dd$buildings <- builds
    
        name            buildings
    1   London         Tower Bridge
    2 New York                     
    3    Paris Eiffel Tower, Louvre
    
    0 讨论(0)
  • 2020-12-16 09:06

    If you're looking to exactly reproduce the desired output you showed in your question, you can convert your XML to a list and then extract the information you want:

    xml_list <- xmlToList(xmlParse(xml_data))
    

    First loop through each "building" node and remove those that contain "station":

    xml_list <- lapply(xml_list, lapply, function(x) {
      x[!sapply(x, function(y) any(y == "station"))]
    })
    

    Then collect data for each city into a data frame

    xml_list <- lapply(xml_list, function(x) {
      bldgs <- unlist(x$buildings)
      bldgs <- bldgs[bldgs != "landmark"]
      if(is.null(bldgs)) bldgs <- NA
      data.frame(
        "city" = x$name,
        "landmark" = bldgs,
        stringsAsFactors = FALSE)
    })
    

    Then combine information from all cities together:

    xml_output <- do.call("rbind", xml_list)
    xml_output
               city     landmark
    city     London Tower Bridge
    city1  New York         <NA>
    city.1    Paris Eiffel Tower
    city.2    Paris       Louvre
    
    0 讨论(0)
  • 2020-12-16 09:15

    Assuming the XML data is in a file called world.xml read it in and iterate over the cities extracting the city name and the bname of any associated landmarks :

    library(XML)
    doc <- xmlParse("world.xml", useInternalNodes = TRUE)
    
    do.call(rbind, xpathApply(doc, "/world/city", function(node) {
    
       city <- xmlValue(node[["name"]])
    
       xp <- "./buildings/building[./type/text()='landmark']/bname"
       landmark <- xpathSApply(node, xp, xmlValue)
       if (is.null(landmark)) landmark <- NA
    
       data.frame(city, landmark, stringsAsFactors = FALSE)
    
    }))
    

    The result is:

          city     landmark
    1   London Tower Bridge
    2 New York         <NA>
    3    Paris Eiffel Tower
    4    Paris       Louvre
    
    0 讨论(0)
提交回复
热议问题