How do you summarize columns based on unique IDs without knowing IDs in R?

问题

I've been going through the posts regarding summarizing data, but haven't seem to have found what I'm looking for.

I wish to create a summary "count-table" which will allow me to see how often a certain medication was given to patients. The fact that some patients received multiple medications simultaneously doesn't matter, because I simply want a summary of all the medication given and then calculate which percentage each medication class is of all medication given. The issue is, that I don't know the names of the possible medication given, they're "hidden" somewhere in the data.frame, thus, I have to specify which columns R would have to look through first to create a "list" by which it can then summarize the columns.

I anticipate that this points towards the plyr package but my attempts to use the functions in it correctly haven't worked until now.

My df looks something like this

x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df<-data.frame(x,y,z)
head(df)
  x y z
1 a a f
2 a c d
3 b b e
4 c d b
5 a a b
6 c d d

as you can see, the data.frame contains three columns which have the same but also different letters, indicating the name of the medication given.

What I'd now like to do is create a list of unique characters,

unique(x)
unique(y)
unique(z)

which serves as my reference list by which R can then summarize the counts in each column.

summary(df)

returns a summary of counts of each column but not of each ID itself and also without a percentage of all unique counts.

I also tried the following, which sort of goes in the right direction, but ideally, I'd like to have a list of unique characters, which I can feed to the length argument

ddply(df, .(x), summarize, counts=length(unique(y)))

Any idea how I could do this? Help much appreciated.

回答1:

If you just want to have a count for the whole dataframe, you can use table(unlist(df)) (see also @goctlr's answer) & if you also want to have probabilities: prop.table(table(unlist(df))). When you also want to get the count for the individual columns, it gets more difficult.

To get the count for each column and the total count, I wrote the following function:

# some reproducible data:
set.seed(1)
x <- sample(letters[1:4], 20, replace = TRUE)
y <- sample(letters[1:5], 20, replace = TRUE)
z <- sample(letters[1:6], 20, replace = TRUE)
df <- data.frame(x,y,z)

# the function
func <- function(x) {
  x2 <- data.frame()
  nms <- names(x)
  id <- sort(unique(unlist(x)))
  for(i in 1:length(id)) {
    for(j in 1:length(nms)) {
      x2[i,j] <- sum(x[,j] %in% id[i])
    }
  }
  names(x2) <- nms
  x2$total <- rowSums(x2)
  x2 <- cbind(id,x2)
  assign("dat", x2, envir = .GlobalEnv)
}

Executing the function with func(df) will give you a dataframe dat in your global envirenment:

> dat
  id x y z total
1  a 4 4 3    11
2  b 5 5 2    12
3  c 5 4 4    13
4  d 6 4 5    15
5  e 0 3 5     8
6  f 0 0 1     1

After that, you can calculate the percentages with for example the dplyr package:

library(dplyr)
dat <- dat %>% mutate(xperc=round(100*x/sum(total),1),
                      yperc=round(100*y/sum(total),1),
                      zperc=round(100*z/sum(total),1),
                      perc=round(100*total/sum(total),1))

which results in:

> dat
  id x y z total xperc yperc zperc perc
1  a 4 4 3    11   6.7   6.7   5.0 18.3
2  b 5 5 2    12   8.3   8.3   3.3 20.0
3  c 5 4 4    13   8.3   6.7   6.7 21.7
4  d 6 4 5    15  10.0   6.7   8.3 25.0
5  e 0 3 5     8   0.0   5.0   8.3 13.3
6  f 0 0 1     1   0.0   0.0   1.7  1.7

回答2:

For a summary of counts for the whole data frame you can unlist the data frame and then call the table function:

table(unlist(df))

To get the percentage of total counts, save the result and use the prop.table function:

tout <- table(unlist(df))
prop.table(tout)

来源：https://stackoverflow.com/questions/26294297/how-do-you-summarize-columns-based-on-unique-ids-without-knowing-ids-in-r

标签

count

plyr

dplyr

summary