问题
I have the following data frame in R:
id<-c(1,2,3,4,10,2,4,5,6,8,2,1,5,7,7)
date<-c(19970807,19970902,19971010,19970715,19991212,19961212,19980909,19990910,19980707,19991111,19970203,19990302,19970605,19990808,19990706)
spent<-c(1997,19,199,134,654,37,876,890,873,234,643,567,23,25,576)
df<-data.frame(id,date,spent)
I need to take a random sample of 3 customers (based on id) in a way that all observations of the customers be extracted.
回答1:
You want to use %in%
and unique
df[df$id %in% sample(unique(df$id),3),]
## id date spent
## 4 4 19970715 134
## 7 4 19980909 876
## 8 5 19990910 890
## 10 8 19991111 234
## 13 5 19970605 23
Using data.table
to avoid $
referencing
library(data.table)
DT <- data.table(df)
DT[id %in% sample(unique(id),3)]
## id date spent
## 1: 1 19970807 1997
## 2: 4 19970715 134
## 3: 4 19980909 876
## 4: 1 19990302 567
## 5: 7 19990808 25
## 6: 7 19990706 576
This ensures that you are always evaluating the expressions within the data.table.
回答2:
Use something like:
df[sample(df$id, 3), ]
# id date spent
# 1 1 19970807 1997
# 5 10 19991212 654
# 8 5 19990910 890
Of course, your samples would be different.
Update
If you want unique customers, you can aggregate
first.
df2 = aggregate(list(date = df$date, spent = df$spent), list(id = df$id), c)
df2[sample(df2$id, 3), ]
# id date spent
# 4 4 19970715, 19980909 134, 876
# 5 5 19990910, 19970605 890, 23
# 8 8 19991111 234
OR--an option with out aggregate
:
df[df$id %in% sample(unique(df$id), 3), ]
# id date spent
# 1 1 19970807 1997
# 3 3 19971010 199
# 12 1 19990302 567
# 14 7 19990808 25
# 15 7 19990706 576
来源:https://stackoverflow.com/questions/12032307/getting-a-sample-of-a-data-frame-in-r