问题
Suppose I have a dataset with 1,000,000 observations. Variables are age, race, gender. This dataset represents the whole US.
How can I draw a sample of 1,000 people from this dataset, given a certain age distribution? E.g. I want this datset with 1000 people distributed like this:
0.3 * Age 0 - 30
0.3 * Age 31 - 50
0.2 * Age 51 - 69
0.2 * Age 70 - 100
Is there a quick way to do it? I already created a sample of 1000 people with the desired age distribution, but how do I combine that now with my original dataset?
As an example, this is how I have created the population distribution of Maine:
set.seed(123)
library(magrittr)
popMaine <- data.frame(min=c(0, 19, 26, 35, 55, 65), max=c(18, 25, 34, 54, 64, 113), prop=c(0.2, 0.07, 0.11, 0.29, 0.14, 0.21))
Mainesample <- sample(nrow(popMaine), 1000, replace=TRUE, prob=popMaine$prop)
Maine <- round(popMaine$min[Mainesample] + runif(1000) * (popMaine$max[Mainesample] - popMaine$min[Mainesample])) %>% data.frame()
names(Texas) <- c("Age")
Now I don't know how to bring this together with my other dataset which has the whole US population... I'd appreciate any help, I am stuck for quite a while now...
回答1:
Below are four different approaches. Two use functions from, respectively, the splitstackshape and sampling packages, one uses base mapply, and one uses map2 from the purrr package (which is part of the tidyverse collection of packages).
First let's set up some fake data and sampling parameters:
# Fake data
set.seed(156)
df = data.frame(age=sample(0:100, 1e6, replace=TRUE))
# Add a grouping variable for age range
df = df$age.groups = cut(df$age, c(0,30,51,70,Inf), right=FALSE)
# Total number of people sampled
n = 1000
# Named vector of sample proportions by group
probs = setNames(c(0.3, 0.3, 0.2, 0.2), levels(df$age.groups))
Using the above sampling parameters, we want to sample n total values with a proportion probs from each age group.
Option 1: mapply
mapply can apply multiple arguments to a function. Here, the arguments are (1) the data frame df split into the four age groupings, and (2) probs*n, which gives the number of rows we want from each age group:
df.sample = mapply(a=split(df, df$age.groups), b=probs*n,
function(a,b) {
a[sample(1:nrow(a), b), ]
}, SIMPLIFY=FALSE)
mapply returns a list with of four data frames, one for each stratum. Combine this list into a single data frame:
df.sample = do.call(rbind, df.sample)
Check the sampling:
table(df.sample$age.groups)
[0,30) [30,51) [51,70) [70,Inf) 300 300 200 200
Option 2: stratified function from the splitstackshape package
The size argument requires a named vector with the number of samples from each stratum.
library(splitstackshape)
df.sample2 = stratified(df, "age.groups", size=probs*n)
Option 3: strata function from the sampling package
This option is by far the slowest.
library(sampling)
# Data frame must be sorted by stratification column(s)
df = df[order(df$age.groups),]
sampled.rows = strata(df, 'age.groups', size=probs*n, method="srswor")
df.sample3 = df[sampled.rows$ID_unit, ]
Option 4: tidyverse packages
map2 is like mapply in that it applies two arguments in parallel to a function, in this case the dplyr package's sample_n function. map2 returns a list of four data frames, one for each stratum, which we combine into a single data frame with bind_rows.
library(dplyr)
library(purrr)
df.sample4 = map2(split(df, df$age.groups), probs*n, sample_n) %>% bind_rows
Timings
library(microbenchmark)
Unit: milliseconds expr min lq mean median uq max neval cld mapply 86.77215 110.82979 156.66855 123.95275 145.25115 486.2078 10 a strata 5028.42933 5541.40442 5709.16796 5699.50711 5845.69921 6467.7250 10 b stratified 38.33495 41.76831 89.93954 45.43525 79.18461 408.2346 10 a tidyverse 71.48638 135.49113 143.12011 142.86866 155.72665 192.4174 10 a
来源:https://stackoverflow.com/questions/40353057/sampling-a-specific-age-distribution-from-a-dataset