问题
I'm a relatively new R user. I would really appreciate any help with my dataset please.
I have a dataset with 24 million rows. There are 3 variables in the dataset: patient name, pharmacy name, and count of medications picked up from the pharmacy at that visit.
Some patients appear in the dataset more than once (ie. they have picked up medications from different pharmacies at different time points).
The data frame looks like this:
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
From this data I want to generate a new dataset, which has ONE pharmacy for each patient. This pharmacy needs to be the one where the patient has picked up the highest number of medications.
For example: for Tom his most frequent pharmacy is Pharmacy B because he has picked up 13 medications from there (5+8 meds). The dataset I would like to generate:
data.frame(name = c("Tom", "Rob", "Amy"),
pharmacy = c("B", "B", "C"),
meds = c(13, 2, 2))
Can someone please help me with writing a code to do this?
I have tried various functions in R, such as dplyr
, tidyr
, aggregate()
with no success. Any help would be genuinely appreciated.
Thank you very much
Alex
回答1:
Your question is not reproducible. But here is one solution:
# create reproducible example of data
dataset1 <- data.frame(
name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("pharmacy_A", "pharmacy_B", "pharmacy_B", "pharmacy_B", "pharmacy_C"),
meds_count = c(3, 2, 5, 8, 2))
library(dplyr) #load dplyr
dataset2 <- dataset1 %>% group_by(name, pharmacy) %>% # group by your grouping variables
summarise(meds_count = sum(meds_count)) %>% # sum no. of meds by your grouping variables
top_n(1, meds_count) %>% # filter for only the top 1 count
ungroup()
Resulting dataframe:
> dataset2
# A tibble: 3 x 3
name pharmacy meds_count
<fct> <fct> <dbl>
1 Amy pharmacy_C 2.00
2 Rob pharmacy_B 2.00
3 Tom pharmacy_B 13.0
回答2:
If I understood you correctly, I think you're looking for something like this.
require(tidyverse)
#Sample data. I copied yours.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2))
Edit. I changed the group_by(), summarise() and added filter.
df %>%
group_by(name, pharmacy) %>%
summarise(SumMeds = sum(meds, na.rm = TRUE)) %>%
filter(SumMeds == max(SumMeds))
Results:
name pharmacy SumMeds
<fct> <fct> <dbl>
1 Amy C 2.
2 Rob B 2.
3 Tom B 13.
回答3:
Generating your dataset:
patient = c("Tom","Rob","Tom","Tom","Amy")
pharmacy = c("A","B","B","B","C")
meds = c(3,2,5,8,2)
df = data.frame(patient,pharmacy,meds)
df is your dataframe
library(dplyr)
df = df %>% group_by(patient,pharmacy) %>%
summarize(meds =sum(meds)) %>%
group_by(patient) %>%
filter(meds == max(meds))
- Take your df, group by patient and pharmacy
- calculate total medicines bought by each patient from different pharmacies by taking the sum of medicines.
- Then group_by patient
- Finally filter for max.
Print the dataframe
print(df)
回答4:
You can do it in base R with aggregate
twice followed by merge
.
It seems to me a bit complicated to have to use aggregate
twice. Maybe dplyr
solutions run more quickly, especially with a dataset with 24 million rows.
agg <- aggregate(meds ~ name + pharmacy, df, FUN = function(x) sum(x))
agg2 <- aggregate(meds ~ name, agg, function(x) x[which.max(x)])
merge(agg, agg2)[c(1, 3, 2)]
# name pharmacy meds
#1 Amy C 2
#2 Rob B 2
#3 Tom B 13
Data.
This is the dataset in the question after the edit.
df <- data.frame(name = c("Tom", "Rob", "Tom", "Tom", "Amy"),
pharmacy = c("A", "B", "B", "B", "C"),
meds = c(3, 2, 5, 8, 2), stringsAsFactors = FALSE)
回答5:
Assuming the following dataset:
df <- tribble(
~patient, ~pharmacy, ~medication,
"Tom", "Pharmacy A", "3 meds",
"Rob", "Pharmacy B", "2 meds",
"Tom", "Pharmacy B", "5 meds",
"Tom", "Pharmacy B", "8 meds",
"Amy", "Pharmacy C", "2 meds"
)
A tidyverse-friendly option could be:
df %>%
mutate(med_n = as.numeric(str_extract(medication, "[0-9]"))) %>% # 1
group_by(patient, pharmacy) %>% # 2
mutate(med_sum = sum(med_n)) %>% # 3
group_by(patient) %>% # 4
filter(med_sum == max(med_sum)) %>% # 5
select(patient, pharmacy, med_sum) %>% # 6
distinct() # 7
- create a numeric variable as you can't add strings
- among all patient / pharmacy couples
- find the total number of medications
- then among all patients
- keep only pharmacies with the highest patient / pharm totals
- discard useless variables
- discard duplicated lines (several lines per patient / pharmacy couple)
来源:https://stackoverflow.com/questions/50480860/manipulating-variables-to-produce-a-new-dataset-in-r