I am trying to process a 20 GB data file in R. I have 16 gigs RAM and i7 processor. I am reading the data using :
y<-read.table(file="sample.csv", header = TRUE, sep = ",", skip =0, nrows =50000000)
The dataset 'y' is as follows :
id feature
21 234
21 290
21 234
21 7802
21 3467
21 234
22 235
22 235
22 1234
22 236
22 134
23 9133
23 223
23 245
23 223
23 122
23 223
So above is sample dataset, which shows different features for a particular id. I want to count how many times a particular feature listed in another dataset x has occurred for an id in y.
The dataset x is as follows:
id feature
21 234
22 235
23 223
And the final output that I want is as follows:
id feature_count
21 3
22 2
23 3
As we see 234 occurred thrice for 21, 235 occurred twice for 22 and 223 occurred twice for 23.
For this I have tried getting positions where the new id starts: (eg 1st, 7th and 12th position for above sample) and then count a feature using a for loop as follows:
Getting positions of different ids
positions=0
positions[1]=1
j=2
for(i in 1:50000000){
if(y$id[i]!=y$id[i+1]){
positions[j]=i+1
j=j+1
}
}
Since the data is huge the looping is taking a lots of time.(for 50 Million rows it takes 321 secs on above mentioned config PC and I have 300 Million rows).
Counting the features that match with the given feature in 'x'.( x is the data frame specified above from which the features are to be matched with that of y .On being matched feature_count is incremented)
for(i in 1 :length(positions)){
for(j in positions[i]:positions[i+1]){
if(y$feature[j]==x$feature[i]){
feature_count[i]=feature_count[i]+1
}
}
}
Are there any R functions which can collectively do this job for me in a faster time. Also incrementing for loop using "positions[i]:positions[i+1]" throws an error saying NA arguments in for loop. Please suggest a right way to do that too.
I admit that I don't really understand the question the way it is written, but it sounds like "data.table" would be the way to go, and you should look into the .N
function. As already mentioned fread
is going to be much better than read.csv
, so I'll assume that you've read the data into a data.table
named "DT".
Here's a small one:
DT <- data.table(id = c(rep(21, 6), rep(22, 5), 23, 23),
feature = c(234, 290, 234, 7802, 3467, 234, 235,
235, 1234, 236, 134, 9133, 223))
DT
# id feature
# 1: 21 234
# 2: 21 290
# 3: 21 234
# 4: 21 7802
# 5: 21 3467
# 6: 21 234
# 7: 22 235
# 8: 22 235
# 9: 22 1234
# 10: 22 236
# 11: 22 134
# 12: 23 9133
# 13: 23 223
If you just wanted to count the number of each unique feature, you could do:
DT[, .N, by = "id,feature"]
# id feature N
# 1: 21 234 3
# 2: 21 290 1
# 3: 21 7802 1
# 4: 21 3467 1
# 5: 22 235 2
# 6: 22 1234 1
# 7: 22 236 1
# 8: 22 134 1
# 9: 23 9133 1
# 10: 23 223 1
If you wanted the count of the first "feature", by "id", you could use:
DT[, .N, by = "id,feature"][, .SD[1], by = "id"]
# id feature N
# 1: 21 234 3
# 2: 22 235 2
# 3: 23 9133 1
If you wanted to get the most frequently occurring "feature" by "id" (which is the same result as above, in this case), you can try the following:
DT[, .N, by = "id,feature"][, lapply(.SD, function(x) x[which.max(N)]), by = "id"]
Update
Based on your new description, this seems much easier.
Just merge
your datasets and aggregate
the counts. Again, fast to do in "data.table":
DTY <- data.table(y, key = "id,feature")
DTX <- data.table(x, key = "id,feature")
DTY[DTX][, .N, by = id]
# id N
# 1: 21 3
# 2: 22 2
# 3: 23 3
Or:
DTY[, .N, by = key(DTY)][DTX]
# id feature N
# 1: 21 234 3
# 2: 22 235 2
# 3: 23 223 3
This is assuming that "x" and "y" are defined as the following to begin with:
x <- structure(list(id = 21:23, feature = c(234L, 235L, 223L),
counts = c(3L, 2L, 3L)), .Names = c("id", "feature", "counts"),
row.names = c(NA, -3L), class = "data.frame")
y <- structure(list(id = c(21L, 21L, 21L, 21L, 21L, 21L, 22L, 22L,
22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L), feature = c(234L,
290L, 234L, 7802L, 3467L, 234L, 235L, 235L, 1234L, 236L, 134L,
9133L, 223L, 245L, 223L, 122L, 223L)), .Names = c("id", "feature"),
class = "data.frame", row.names = c(NA, -17L))
I would recommend the data.table package for this (fread
is very fast!), then set up a loop that loops through the file reading in chunks at a time and storing the feature count sums. Here are some adapted lines of a function I have for looping for a file, it probably won't work as is, but you can get an idea what to do
require(data.table)
LineNu <- as.numeric(gsub(" .+","",system2("wc",paste("-l",your.file,sep=" "),stdout=TRUE, stderr=TRUE)))
DT <- fread(your.file,nrows=50000000,sep=",",header=TRUE)
KEEP.DT <- DT[,list("feature"=sum(length(feature))),by=id]
rm(DT) ; gc()
Starts <- c(seq(50000000,LineNu,by=50000000),LineNu)
for (i in 2:(length(Starts)-1)) {
cat(paste0("Filtering next 50000000 lines ", i, " of ",length(Starts)-1, " \n"))
DT <- fread(your.file,skip=Starts[i],nrows=ifelse(50000000*(i-1) < Starts[length(Starts)],50000000,(50000000*(i-1)) - Starts[length(Starts)]),sep=",",header=FALSE)
DT[,list("feature"=sum(length(feature))),by=id]
KEEP.DT <- rbind(KEEP.DT,DT)
rm(DT) ; gc()
}
You may need to redo the DT[sum(length)] part since some id's might get read in in different chunks.
For your example:
apply(sign(table(y)), 1, sum)
21 22 23
4 4 2
How about table()?
> set.seed(5)
> ids <- sample(1:3, 12, TRUE)
> features <- sample(1:4, 12, TRUE)
> cbind(ids, features)
ids features
[1,] 1 2
[2,] 3 3
[3,] 3 2
[4,] 1 1
[5,] 1 2
[6,] 3 4
[7,] 2 3
[8,] 3 4
[9,] 3 4
[10,] 1 3
[11,] 1 1
[12,] 2 1
> table(ids, features)
features
ids 1 2 3 4
1 2 2 1 0
2 1 0 1 0
3 0 1 1 3
So for example feature 4 appears 3 times in id 3.
EDIT: You can use as.data.frame() to "flatten" the table and get:
> as.data.frame(table(ids, features))
ids features Freq
1 1 1 2
2 2 1 1
3 3 1 0
4 1 2 2
5 2 2 0
6 3 2 1
7 1 3 1
8 2 3 1
9 3 3 1
10 1 4 0
11 2 4 0
12 3 4 3
来源:https://stackoverflow.com/questions/24645195/count-features-for-different-ids-in-columns-in-r-in-faster-way