I want to summarise the percentage of people that have been treated BY region.
I have created a dummy dataset for this purpose:
id <- seq(1:1000)
You could also use data.table:
library(data.table)
setDT(d)[,.(.N,prop=sum(treatment==2)/.N),
by=region]
region N prop
1: A 200 0.5
2: B 200 0.5
3: C 200 0.5
4: D 200 0.5
5: E 200 0.5
For completeness, here's how you can do it using ddply() from plyr:
library(plyr)
ddply(d[!is.na(d$id),],.(region),summarize,
N = length(region),
prop=mean(treatment==1))
# region N prop
# 1 A 200 0.5
# 2 B 200 0.5
# 3 C 200 0.5
# 4 D 200 0.5
# 5 E 200 0.5
This assumes that you want to deal with the NA values in id by removing the observation.
A dplyr solution:
library(dplyr)
d %>% group_by(region) %>% summarize(NumPat=n(),prop=sum(treatment==1)/n())
What we do here is group by region and then pipe it to summarize by the number of patients in each group, and then calculate the proportion of those patients that received treatment 1.
If I understand the question correctly, this can be very easily (and fast!) done with table and prop.table:
prop.table(table(d$treatment, d$region))
This gives you the percentages of each cell. If you want to get row- or column-wise percentages, you want to make use of the margin parameter in prop.table:
prop.table(table(d$treatment, d$region), margin = 2) # column-wise
prop.table(table(d$treatment, d$region), margin = 1) # row-wise