I would like to increase the speed of my for loop via vectorization or using Data.table or something else. I have to run the code on 1,000,000 rows and my code is really slo
You can use Rcpp when vectorization is difficult.
library(Rcpp)
cppFunction('
IntegerVector bin(NumericVector Volume, int n) {
IntegerVector binIdexVector(Volume.size());
int binIdex = 1;
double totalVolume =0;
for(int i=0; i<Volume.size(); i++){
totalVolume = totalVolume + Volume[i];
if (totalVolume <= n) {
binIdexVector[i] = binIdex;
} else {
binIdex++;
binIdexVector[i] = binIdex;
totalVolume = Volume[i];
}
}
return binIdexVector;
}')
all.equal(bin(Volume, 100), binIdexVector)
#[1] TRUE
It's faster than findInterval(cumsum(Volume), seq(0, sum(Volume), by=100))
(which of course gives an inexact answer)
Volume<-sample(1:5,500,replace=TRUE)
binLabels<- cumsum(diff(cumsum(Volume) %% 100) <0) + 1
This results in the vector binLabels
which indicates which bin each data point belongs to. Each bin will hold the number of data points required such that the sum of the data points is 100.