I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sur
So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.
First, I had to understand your data set. It appears to be information about health providers.
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE )
types = sapply(providers,"[[","type")
table(types)
# FACILITY INDIVIDUAL
# 279 2977
FACILITY entries have the "ID" fields facility_name and facility_type. INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.npi and last_updated_on.addresses and plans. For example addresses is a list that contains city, state, etc.Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.
library(dplyr)
unfurl_npi_data = function (x) {
repeat_cols = c("plans","addresses")
id_cols = setdiff(names(x),repeat_cols)
repeat_data = x[repeat_cols]
id_data = x[id_cols]
# Denormalized ID data
id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
atomic_colnames = names(which(!sapply(id_data, is.list)))
df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
colnames(id_data_df)[df_atomic_cols] = atomic_colnames
# Join the plans and addresses (denormalized)
repeated_data = lapply(repeat_data, rbind_all)
repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])
merge(id_data_df, repeated_data_crossed)
}
providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))
Then do some cleanup.
unique_df = function(x) {
chr_col_names = names(which(sapply(x, class) == "character"))
for( col in chr_col_names )
x[[col]] = toupper(x[[col]])
unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)
And now you can ask some interesting questions. For example, how many addresses does each health care provider have?
unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
num_addresses = unique_providers %>% count(first, middle, last, gender)
table(num_addresses$n)
# 1 2 3 4 5 6 7 8 9 12 13
# 2258 492 119 33 43 21 6 1 2 1 1
At addresses with more than five people, what is the percent of male healthcare providers?
address_pcts = unique_providers %>%
group_by(address, city, state) %>%
filter(n()>5) %>%
arrange(address) %>%
summarise(pct_male = sum(gender=="MALE")/n())
library(ggplot2)
qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)
And on and on...