Sampling small data frame from a big dataframe

和自甴很熟 提交于 2019-12-20 02:08:50

问题


I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable. This can be achieved by separating the data frame by the levels and sample from each of those . I thought ddply (data-frame to data-frame) would do it for me. Taking a minimal example:

set.seed(1)
data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100))
> summary(data1$a)
B0 B1 B2 
30 32 38

The following commands perform the sampling...

When I enter...

data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE))

I get the following error

   Error in `[.data.frame`(x, .Internal(sample(length(x), size, replace,  : 
  cannot take a sample larger than the population when 'replace = FALSE'

This error is because x inside the ddply function is not a vector but a dataframe.

Does anyone have any idea on how to achieve this sampling? I know one way is to not use ddply and just do (1) segregation, (2) sampling, and (3) collation in three steps. But I was wondering there must by some way ...with base or plyr functions...

Thank you for your help...


回答1:


I think what you want is to subset the data frame passed in x using sample:

ddply(data1,.(a),function(x) x[sample(nrow(x),20,replace = FALSE),])

But, of course, you still need to take care that the size of the sample for each piece (in this case 20) is at least as big as the smallest subset of your data based on the levels of a.




回答2:


It would seem that if you want to sample a category that has less than 20 rows, you'd need replace=TRUE...

This might do the trick:

ddply(data1,'a',function(x) x[sample.int(NROW(x),20,replace=TRUE),])


来源:https://stackoverflow.com/questions/9913066/sampling-small-data-frame-from-a-big-dataframe

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!