Read Json file into a data.frame without nested lists

前端 未结 4 1031
清歌不尽
清歌不尽 2020-11-30 06:50

I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sur

4条回答
  •  不知归路
    2020-11-30 07:15

    So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.

    First, I had to understand your data set. It appears to be information about health providers.

     providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE ) 
     types = sapply(providers,"[[","type")
     table(types)
    
     # FACILITY INDIVIDUAL 
     #    279       2977 
    
    • FACILITY entries have the "ID" fields facility_name and facility_type.
    • INDIVIDUAL entries have the "ID" fields name, speciality, accepting, languages, and gender.
    • All entries have "ID" fields npi and last_updated_on.
    • All entries have two nested fields: addresses and plans. For example addresses is a list that contains city, state, etc.

    Since there are multiple addresses for each npi, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for the plans. Then I'll join the addresses and plans into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.

    library(dplyr)
    unfurl_npi_data = function (x) {
      repeat_cols = c("plans","addresses")
      id_cols = setdiff(names(x),repeat_cols)
      repeat_data = x[repeat_cols]
      id_data  = x[id_cols]
    
      # Denormalized ID data
      id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
      atomic_colnames = names(which(!sapply(id_data, is.list)))
      df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
      colnames(id_data_df)[df_atomic_cols] = atomic_colnames
    
      # Join the plans and addresses (denormalized)
      repeated_data = lapply(repeat_data, rbind_all)
      repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])
    
      merge(id_data_df, repeated_data_crossed)
    }
    
    providers2 = split(providers, types)
    providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))
    

    Then do some cleanup.

    unique_df = function(x) {
      chr_col_names = names(which(sapply(x, class) == "character"))
      for( col in chr_col_names )
        x[[col]] = toupper(x[[col]])
      unique(x)
    }
    providers3 = lapply(providers3, unique_df)
    facilities = providers3[["FACILITY"]]
    individuals = providers3[["INDIVIDUAL"]]
    rm(providers, providers2, providers3)
    

    And now you can ask some interesting questions. For example, how many addresses does each health care provider have?

     unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
     num_addresses = unique_providers %>% count(first, middle, last, gender)
     table(num_addresses$n)
    
     #    1    2    3    4    5    6    7    8    9   12   13 
     # 2258  492  119   33   43   21    6    1    2    1    1 
    

    At addresses with more than five people, what is the percent of male healthcare providers?

    address_pcts = unique_providers %>% 
      group_by(address, city, state) %>%
      filter(n()>5) %>%
      arrange(address) %>%
      summarise(pct_male = sum(gender=="MALE")/n())
    library(ggplot2)
    qplot(address_pcts$pct_male, binwidth=1/7) + xlim(0,1)
    

    And on and on...

提交回复
热议问题