I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.
I am trying to visualize their distribution (histogram)
Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:
library(Rcpp)
cppFunction('
std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
int bin, nmissing = 0;
std::vector<int> out;
NumericVector::iterator x_it = x.begin(), x_end;
for(; x_it != x.end(); ++x_it) {
double val = *x_it;
if (ISNAN(val)) {
++nmissing;
} else {
bin = (val - origin) / width;
if (bin < 0) continue;
// Make sure there\'s enough space
if (bin >= out.size()) {
out.resize(bin + 1);
}
++out[bin];
}
}
// Put missing values in the last position
out.push_back(nmissing);
return out;
}
')
x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
# user system elapsed
# 1.373 0.000 1.373
That said, hist
is pretty fast here too:
system.time(hist(x8, breaks = 100, plot = F))
# user system elapsed
# 7.281 1.362 8.669
It's straightforward to use bin3
to make a histogram or frequency polygon:
# First we create some sample data, and bin each column
library(reshape2)
library(ggplot2)
df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))
# Next we match up the bins with the breaks
binsdf <- data.frame(
breaks = c(seq(0, 1, length = 100), NA),
bins)
# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)
FYI, the reason I had bin3
on hand is that I'm working on how to make this speed the default in ggplot2 :)