SuperImpose Histogram fits in one plot ggplot

后端 未结 1 1372
小鲜肉
小鲜肉 2020-12-29 16:55

I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.

I am trying to visualize their distribution (histogram)

相关标签:
1条回答
  • 2020-12-29 17:37

    Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:

    library(Rcpp)
    cppFunction('
      std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
        int bin, nmissing = 0;
        std::vector<int> out;
    
        NumericVector::iterator x_it = x.begin(), x_end;
        for(; x_it != x.end(); ++x_it) {
          double val = *x_it;
          if (ISNAN(val)) {
            ++nmissing;
          } else {
            bin = (val - origin) / width;
            if (bin < 0) continue;
    
            // Make sure there\'s enough space
            if (bin >= out.size()) {
              out.resize(bin + 1);
            }
            ++out[bin];
          }
        }
    
        // Put missing values in the last position
        out.push_back(nmissing);
        return out;
      }
    ')
    
    x8 <- runif(1e8)
    system.time(bin3(x8, 1/100))
    #   user  system elapsed 
    #  1.373   0.000   1.373 
    

    That said, hist is pretty fast here too:

    system.time(hist(x8, breaks = 100, plot = F))
    #   user  system elapsed 
    #  7.281   1.362   8.669 
    

    It's straightforward to use bin3 to make a histogram or frequency polygon:

    # First we create some sample data, and bin each column
    
    library(reshape2)
    library(ggplot2)
    
    df <- as.data.frame(replicate(5, runif(1e6)))
    bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))
    
    # Next we match up the bins with the breaks
    binsdf <- data.frame(
      breaks = c(seq(0, 1, length = 100), NA),
      bins)
    
    # Then melt and plot
    binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
    qplot(breaks, value, data = binsm, geom = "line", colour = variable)
    

    FYI, the reason I had bin3 on hand is that I'm working on how to make this speed the default in ggplot2 :)

    0 讨论(0)
提交回复
热议问题