Allow a maximum number of entries when certain conditions apply

情到浓时终转凉″ 提交于 2019-12-02 01:13:33

if dat is your dataframe:

do.call(rbind, 
        by(dat, INDICES=list(dat$belongID, dat$sourceID), 
           FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))

Say your data is in df. The ordered (by uniqID) output is obtained after this:

tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
    if(all(is.na(tab[bIDs == i, ])))next
    ids <- na.omit(sIDs[tab[i, sIDs] > 5])
    for(j in ids)
    {
        cond <- df$belongID == i & df$sourceID == j
        old <- df[cond,]
        id5 <- order(old$Time, decreasing = TRUE)[1:5]
        new <- old[id5,]
        df <- df[!cond,]
        df <- rbind(df, new)
    }
}
df[order(df$uniqID), ]

A solution in two lines using the plyr package:

library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]

The results:

   belongID sourceID uniqID Time
5         1     1001    101    5
6         1     1002    102    5
4         1     1001    103    4
2         1     1001    104    3
3         1     1001    105    3
7         1     1005    106    2
1         1     1001    108    2
10        2     1005    109    5
16        2     1006    110    5
11        2     1005    111    5
17        2     1006    112    5
12        2     1005    113    5
15        2     1006    114    4
9         2     1005    115    4
13        2     1006    116    3
8         2     1005    117    3
14        2     1006    118    3
18        2     1007    122    1
19        3     1010    123    5
20        3     1480    124    2

The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns

Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:

danas.zuokas' solution:

   User     System  Elapsed 
   2829.569   0     2827.86

mplourde's solution:

   User     System  Elapsed 
   765.628  0.000   763.908

Aurdie's solution:

   User     System  Elapsed 
   984.989  0.000   984.010

Therefore I will use mplourde's solution. Thank you all!

This should be faster, using data.table :

DT = as.data.table(dat)

DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]

Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!