问题
I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.
To illustrate this I have the following example dataset:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
1 1001 108 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1005 119 2
2 1006 120 2
2 1005 121 1
2 1007 122 1
3 1010 123 5
3 1480 124 2
The example in the end should look like this:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1007 122 1
3 1010 123 5
3 1480 124 2
There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.
The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..
回答1:
if dat
is your dataframe:
do.call(rbind,
by(dat, INDICES=list(dat$belongID, dat$sourceID),
FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
回答2:
Say your data is in df
. The ordered (by uniqID) output is obtained after this:
tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
if(all(is.na(tab[bIDs == i, ])))next
ids <- na.omit(sIDs[tab[i, sIDs] > 5])
for(j in ids)
{
cond <- df$belongID == i & df$sourceID == j
old <- df[cond,]
id5 <- order(old$Time, decreasing = TRUE)[1:5]
new <- old[id5,]
df <- df[!cond,]
df <- rbind(df, new)
}
}
df[order(df$uniqID), ]
回答3:
A solution in two lines using the plyr
package:
library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]
The results:
belongID sourceID uniqID Time
5 1 1001 101 5
6 1 1002 102 5
4 1 1001 103 4
2 1 1001 104 3
3 1 1001 105 3
7 1 1005 106 2
1 1 1001 108 2
10 2 1005 109 5
16 2 1006 110 5
11 2 1005 111 5
17 2 1006 112 5
12 2 1005 113 5
15 2 1006 114 4
9 2 1005 115 4
13 2 1006 116 3
8 2 1005 117 3
14 2 1006 118 3
18 2 1007 122 1
19 3 1010 123 5
20 3 1480 124 2
回答4:
The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns
Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:
danas.zuokas' solution:
User System Elapsed
2829.569 0 2827.86
mplourde's solution:
User System Elapsed
765.628 0.000 763.908
Aurdie's solution:
User System Elapsed
984.989 0.000 984.010
Therefore I will use mplourde's solution. Thank you all!
回答5:
This should be faster, using data.table :
DT = as.data.table(dat)
DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]
Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?
来源:https://stackoverflow.com/questions/10797655/allow-a-maximum-number-of-entries-when-certain-conditions-apply