subsetting based on number of observations in a factor variable

前端未结

关注

 2  691

情话喂你 2021-01-25 15:34

how do you subset based on the number of observations of the levels of a factor variable? I have a dataset with 1,000,000 rows and nearly 3000 levels, and I want to subset out

2条回答

谎友^ (楼主)

2021-01-25 16:10

I figured it out using the following, as there is no reason to do things twice:

function (df, column, threshold) { 
    size <- nrow(df) 
    if (threshold < 1) threshold <- threshold * size 
    tab <- table(df[[column]]) 
    keep <- names(tab)[tab >  threshold] 
    drop <- names(tab)[tab <= threshold] 
    cat("Keep(",column,")",length(keep),"\n"); print(tab[keep]) 
    cat("Drop(",column,")",length(drop),"\n"); print(tab[drop]) 
    str(df) 
    df <- df[df[[column]] %in% keep, ] 
    str(df) 
    size1 <- nrow(df) 
    cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n") 
    df[[column]] <- factor(df[[column]], levels=keep) 
    df 
}

0 讨论(0)

查看其它2个回答