问题

I'm a relatively new R user. I would really appreciate any help with my dataset please.

I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.

Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).

The data frame looks like this:

df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom",  "Amy"), 
                 pharmacy = c("A", "B", "B", "B", "C"), 
                 meds = c(3, 2, 5, 8, 2))

From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.

For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:

data.frame(name = c("Tom", "Rob",  "Amy"), 
           pharmacy = c("B", "B", "C"), 
           meds = c(13, 2, 2))

Can someone please help me with writing a code to do this? I have tried various functions in R, such as dplyr, tidyr, aggregate() with no success. Any help would be genuinely appreciated.

Thank you very much

Alex

回答1:

Your question is not reproducible. But here is one solution:

# create reproducible example of data 
dataset1 <- data.frame( 
name = c("Tom", "Rob", "Tom", "Tom", "Amy"), 
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),  
meds_count = c(3, 2, 5, 8, 2))

library(dplyr) #load dplyr

dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
                   summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
                   top_n(1, meds_count) %>% # filter for only the top 1 count
                   ungroup()

Resulting dataframe:

> dataset2
# A tibble: 3 x 3
  name  pharmacy   meds_count
  <fct> <fct>           <dbl>
1 Amy   pharmacy_C       2.00
2 Rob   pharmacy_B       2.00
3 Tom   pharmacy_B      13.0

回答2:

If I understood you correctly, I think you're looking for something like this.

require(tidyverse)
#Sample data. I copied yours. 
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom",  "Amy"), 
                 pharmacy = c("A", "B", "B", "B", "C"), 
                 meds = c(3, 2, 5, 8, 2))

Edit. I changed the group_by(), summarise() and added filter.

df %>% 
  group_by(name, pharmacy) %>%
  summarise(SumMeds = sum(meds, na.rm = TRUE)) %>% 
  filter(SumMeds == max(SumMeds))

Results:

  name  pharmacy SumMeds
  <fct> <fct>      <dbl>
1 Amy   C             2.
2 Rob   B             2.
3 Tom   B            13.

回答3:

Generating your dataset:

patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)

df is your dataframe

library(dplyr)

df = df %>% group_by(patient,pharmacy) %>% 
summarize(meds =sum(meds)) %>% 
group_by(patient) %>% 
filter(meds == max(meds))

Take your df, group by patient and pharmacy
calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
Then group_by patient
Finally filter for max.

Print the dataframe

print(df)

回答4:

You can do it in base R with aggregate twice followed by merge.
It seems to me a bit complicated to have to use aggregate twice. Maybe dplyr solutions run more quickly, especially with a dataset with 24 million rows.

agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
#  name pharmacy meds
#1  Amy        C    2
#2  Rob        B    2
#3  Tom        B   13

Data.
This is the dataset in the question after the edit.

df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom",  "Amy"), 
                 pharmacy = c("A", "B", "B", "B", "C"), 
                 meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)

回答5:

Assuming the following dataset:

df <- tribble(
  ~patient, ~pharmacy, ~medication,  
  "Tom", "Pharmacy A", "3 meds",
  "Rob", "Pharmacy B", "2 meds",
  "Tom", "Pharmacy B", "5 meds",
  "Tom", "Pharmacy B", "8 meds",
  "Amy", "Pharmacy C", "2 meds"
)

A tidyverse-friendly option could be:

df %>% 
  mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>%  # 1
  group_by(patient, pharmacy) %>%  # 2
  mutate(med_sum = sum(med_n)) %>%  # 3
  group_by(patient) %>%  # 4
  filter(med_sum == max(med_sum)) %>%  # 5
  select(patient, pharmacy, med_sum) %>%  # 6
  distinct() # 7

create a numeric variable as you can't add strings
among all patient / pharmacy couples
find the total number of medications
then among all patients
keep only pharmacies with the highest patient / pharm totals
discard useless variables
discard duplicated lines (several lines per patient / pharmacy couple)

来源：https://stackoverflow.com/questions/50480860/manipulating-variables-to-produce-a-new-dataset-in-r

标签

dplyr

tidyr

Manipulating variables to produce a new dataset in R