Read Json file into a data.frame without nested lists

前端未结

关注

 4  1031

清歌不尽 2020-11-30 06:50

I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sur

4条回答

不知归路 (楼主)

2020-11-30 07:15

So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.

First, I had to understand your data set. It appears to be information about health providers.

 providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE ) 
 types = sapply(providers,"[[","type")
 table(types)

 # FACILITY INDIVIDUAL 
 #    279       2977

FACILITY entries have the "ID" fields facility_name and facility_type.
INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.
All entries have "ID" fields npi and last_updated_on.
All entries have two nested fields: addresses and plans. For example addresses is a list that contains city, state, etc.

Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.

library(dplyr)
unfurl_npi_data = function (x) {
  repeat_cols = c("plans","addresses")
  id_cols = setdiff(names(x),repeat_cols)
  repeat_data = x[repeat_cols]
  id_data  = x[id_cols]

  # Denormalized ID data
  id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
  atomic_colnames = names(which(!sapply(id_data, is.list)))
  df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
  colnames(id_data_df)[df_atomic_cols] = atomic_colnames

  # Join the plans and addresses (denormalized)
  repeated_data = lapply(repeat_data, rbind_all)
  repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])

  merge(id_data_df, repeated_data_crossed)
}

providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))

Then do some cleanup.

unique_df = function(x) {
  chr_col_names = names(which(sapply(x, class) == "character"))
  for( col in chr_col_names )
    x[[col]] = toupper(x[[col]])
  unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)

And now you can ask some interesting questions. For example, how many addresses does each health care provider have?

 unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
 num_addresses = unique_providers %>% count(first, middle, last, gender)
 table(num_addresses$n)

 #    1    2    3    4    5    6    7    8    9   12   13 
 # 2258  492  119   33   43   21    6    1    2    1    1

At addresses with more than five people, what is the percent of male healthcare providers?

address_pcts = unique_providers %>% 
  group_by(address, city, state) %>%
  filter(n()>5) %>%
  arrange(address) %>%
  summarise(pct_male = sum(gender=="MALE")/n())
library(ggplot2)
qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)

And on and on...

0 讨论(0)

查看其它4个回答