R: convert XML data to data frame

前端 未结 4 583
醉话见心
醉话见心 2020-12-08 04:52

For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet bu

4条回答
  •  离开以前
    2020-12-08 05:08

    It may not be as verbose as the XML package but xml2 doesn't have the memory leaks and is laser-focused on data extraction. I use trimws which is a really recent addition to R core.

    library(xml2)
    
    pg <- read_xml("http://www.ggobi.org/book/data/olive.xml")
    
    # get all the s
    recs <- xml_find_all(pg, "//record")
    
    # extract and clean all the columns
    vals <- trimws(xml_text(recs))
    
    # extract and clean (if needed) the area names
    labs <- trimws(xml_attr(recs, "label"))
    
    # mine the column names from the two variable descriptions
    # this XPath construct lets us grab either the  or  tags
    # and then grabs the 'name' attribute of them
    cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or
                                                          self::realvariable]"), "name")
    
    # this converts each set of  columns to a data frame
    # after first converting each row to numeric and assigning
    # names to each column (making it easier to do the matrix to data frame conv)
    dat <- do.call(rbind, lapply(strsplit(vals, "\ +"),
                                     function(x) {
                                       data.frame(rbind(setNames(as.numeric(x),cols)))
                                     }))
    
    # then assign the area name column to the data frame
    dat$area_name <- labs
    
    head(dat)
    ##   region area palmitic palmitoleic stearic oleic linoleic linolenic
    ## 1      1    1     1075          75     226  7823      672        NA
    ## 2      1    1     1088          73     224  7709      781        31
    ## 3      1    1      911          54     246  8113      549        31
    ## 4      1    1      966          57     240  7952      619        50
    ## 5      1    1     1051          67     259  7771      672        50
    ## 6      1    1      911          49     268  7924      678        51
    ##   arachidic eicosenoic    area_name
    ## 1        60         29 North-Apulia
    ## 2        61         29 North-Apulia
    ## 3        63         29 North-Apulia
    ## 4        78         35 North-Apulia
    ## 5        80         46 North-Apulia
    ## 6        70         44 North-Apulia
    

    UPDATE

    I'd prbly do the last bit this way now:

    library(tidyverse)
    
    strsplit(vals, "[[:space:]]+") %>% 
      map_df(~as_data_frame(as.list(setNames(., cols)))) %>% 
      mutate(area_name=labs)
    

提交回复
热议问题