R: Generic flattening of JSON to data.frame

问题

This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries.

There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr, lapply, etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with collections of complex JSON data structures.

In Python and Ruby I've been well-served by implementing a generic data structure flattening utility that uses the path to a leaf node in a data structure as the name of the value at that node in the flattened data structure. For example, the value my_data[['x']][[2]][['y']] would appear as result[['x.2.y']].

If one has a collection of these data structures that may not be entirely homogeneous the key to doing a successful flattening to a dataframe would be to discover the names of all possible dataframe columns, e.g., by taking the union of all keys/names of the values in the individually flattened data structures.

This seems like a common pattern and so I'm wondering whether someone has already built this for R. If not, I'll build it but, given R's unique promise-based data structures, I'd appreciate advice on an implementation approach that minimizes heap thrashing.

回答1:

Hi @Sim I had cause to reflect on your problem yesterday define:

flatten<-function(x) {
    dumnames<-unlist(getnames(x,T))
    dumnames<-gsub("(*.)\\.1","\\1",dumnames)
    repeat {
        x <- do.call(.Primitive("c"), x)
        if(!any(vapply(x, is.list, logical(1)))){
           names(x)<-dumnames
           return(x)
        }
    }
}
getnames<-function(x,recursive){

    nametree <- function(x, parent_name, depth) {
        if (length(x) == 0) 
            return(character(0))
        x_names <- names(x)
        if (is.null(x_names)){ 
            x_names <- seq_along(x)
            x_names <- paste(parent_name, x_names, sep = "")
        }else{ 
            x_names[x_names==""] <- seq_along(x)[x_names==""]
            x_names <- paste(parent_name, x_names, sep = "")
        }
        if (!is.list(x) || (!recursive && depth >= 1L)) 
            return(x_names)
        x_names <- paste(x_names, ".", sep = "")
        lapply(seq_len(length(x)), function(i) nametree(x[[i]], 
            x_names[i], depth + 1L))
    }
    nametree(x, "", 0L)
}

(getnames is adapted from AnnotationDbi:::make.name.tree)

(flatten is adapted from discussion here How to flatten a list to a list without coercion?)

as a simple example

my_data<-list(x=list(1,list(1,2,y='e'),3))

> my_data[['x']][[2]][['y']]
[1] "e"

> out<-flatten(my_data)
> out
$x.1
[1] 1

$x.2.1
[1] 1

$x.2.2
[1] 2

$x.2.y
[1] "e"

$x.3
[1] 3

> out[['x.2.y']]
[1] "e"

so the result is a flattened list with roughly the naming structure you suggest. Coercion is avoided also which is a plus.

A more complicated example

library(RJSONIO)
library(RCurl)
json.data<-getURL("http://www.reddit.com/r/leagueoflegends/.json")
dumdata<-fromJSON(json.data)
out<-flatten(dumdata)

UPDATE

naive way to remove trailing .1

my_data<-list(x=list(1,list(1,2,y='e'),3))
gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))

> gsub("(*.)\\.1","\\1",unlist(getnames(my_data,T)))
[1] "x.1"   "x.2.1" "x.2.2" "x.2.y" "x.3"

回答2:

R has two packages for dealing with JSON input: rjson and RJSONIO. If I understand correctly what you mean by "collection of non-cyclical homogeneous or heterogeneous data structures", I think either of these packages will import that sort of structure as a list.

You can then flatten that list (into a vector) using the unlist function.

If the list is suitably structured (a non-nested list where each element is the same length) then as.data.frame prvoides an alternative to convert the list to be a data frame.

An example:

(my_data <- list(x = list('1' = 1, '2' = list(y = 2))))
unlist(my_data)

回答3:

The jsonlite package is a fork of RJSONIO specifically designed to make conversion between JSON and data frames easier. You don't provide any example json data, but I think this might be what you are looking for. Have a look at this blog post or the vignette.

回答4:

Great answer with the flatten and getnames functions. Took a few minutes to figure out all the options needed to get from a vector of JSON strings to a data.frame, so I thought I'd record that here. Suppose jsonvec is a vector of JSON strings. The following builds a data.frame (data.table) where there is one row per string, and each column corresponds to a different possible leaf node of the JSON tree. Any string missing a particular leaf node is filled with NA.

library(data.table)
library(jsonlite)
parsed = lapply(jsonvec, fromJSON, simplifyVector=FALSE)
flattened = lapply(parsed, flatten) #using flatten from accepted answer
d = rbindlist(flattened, fill=TRUE)

来源：https://stackoverflow.com/questions/11553592/r-generic-flattening-of-json-to-data-frame

标签

json

dataframe

plyr

data.table