Manipulating variables to produce a new dataset in R

╄→尐↘猪︶ㄣ 提交于 2020-01-15 03:44:24

问题


I'm a relatively new R user. I would really appreciate any help with my dataset please.

I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.

Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).

The data frame looks like this:

df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom",  "Amy"), 
                 pharmacy = c("A", "B", "B", "B", "C"), 
                 meds = c(3, 2, 5, 8, 2))

From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.

For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:

data.frame(name = c("Tom", "Rob",  "Amy"), 
           pharmacy = c("B", "B", "C"), 
           meds = c(13, 2, 2))

Can someone please help me with writing a code to do this? I have tried various functions in R, such as dplyr, tidyr, aggregate() with no success. Any help would be genuinely appreciated.

Thank you very much

Alex


回答1:


Your question is not reproducible. But here is one solution:

# create reproducible example of data 
dataset1 <- data.frame( 
name = c("Tom", "Rob", "Tom", "Tom", "Amy"), 
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),  
meds_count = c(3, 2, 5, 8, 2))

library(dplyr) #load dplyr

dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
                   summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
                   top_n(1, meds_count) %>% # filter for only the top 1 count
                   ungroup()

Resulting dataframe:

> dataset2
# A tibble: 3 x 3
  name  pharmacy   meds_count
  <fct> <fct>           <dbl>
1 Amy   pharmacy_C       2.00
2 Rob   pharmacy_B       2.00
3 Tom   pharmacy_B      13.0 



回答2:


If I understood you correctly, I think you're looking for something like this.

require(tidyverse)
#Sample data. I copied yours. 
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom",  "Amy"), 
                 pharmacy = c("A", "B", "B", "B", "C"), 
                 meds = c(3, 2, 5, 8, 2))

Edit. I changed the group_by(), summarise() and added filter.

df %>% 
  group_by(name, pharmacy) %>%
  summarise(SumMeds = sum(meds, na.rm = TRUE)) %>% 
  filter(SumMeds == max(SumMeds))

Results:

  name  pharmacy SumMeds
  <fct> <fct>      <dbl>
1 Amy   C             2.
2 Rob   B             2.
3 Tom   B            13.



回答3:


Generating your dataset:

patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)

df is your dataframe

library(dplyr)

df = df %>% group_by(patient,pharmacy) %>% 
summarize(meds =sum(meds)) %>% 
group_by(patient) %>% 
filter(meds == max(meds))
  • Take your df, group by patient and pharmacy
  • calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
  • Then group_by patient
  • Finally filter for max.

Print the dataframe

print(df)




回答4:


You can do it in base R with aggregate twice followed by merge.
It seems to me a bit complicated to have to use aggregate twice. Maybe dplyr solutions run more quickly, especially with a dataset with 24 million rows.

agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
#  name pharmacy meds
#1  Amy        C    2
#2  Rob        B    2
#3  Tom        B   13

Data.
This is the dataset in the question after the edit.

df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom",  "Amy"), 
                 pharmacy = c("A", "B", "B", "B", "C"), 
                 meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)



回答5:


Assuming the following dataset:

df <- tribble(
  ~patient, ~pharmacy, ~medication,  
  "Tom", "Pharmacy A", "3 meds",
  "Rob", "Pharmacy B", "2 meds",
  "Tom", "Pharmacy B", "5 meds",
  "Tom", "Pharmacy B", "8 meds",
  "Amy", "Pharmacy C", "2 meds"
)

A tidyverse-friendly option could be:

df %>% 
  mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>%  # 1
  group_by(patient, pharmacy) %>%  # 2
  mutate(med_sum = sum(med_n)) %>%  # 3
  group_by(patient) %>%  # 4
  filter(med_sum == max(med_sum)) %>%  # 5
  select(patient, pharmacy, med_sum) %>%  # 6
  distinct() # 7
  1. create a numeric variable as you can't add strings
  2. among all patient / pharmacy couples
  3. find the total number of medications
  4. then among all patients
  5. keep only pharmacies with the highest patient / pharm totals
  6. discard useless variables
  7. discard duplicated lines (several lines per patient / pharmacy couple)


来源:https://stackoverflow.com/questions/50480860/manipulating-variables-to-produce-a-new-dataset-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!