问题
I have a tibble in R that has dimension of 15,000,000 x 140. Size-wise it's about 6 gb.
I want to check if any of columns 11-40 for a given row start in a specific list. I want to get out a vector of 1 & 0's that is then 15,000,000 long.
I can do this using the following:
subResult <- apply(rawData[,11:40], c(1,2), function(x){substring(x,1,3) %in% c("295", "296", "297", "298", "299")})
result <- apply(subResult, 1, sum)
Problem is that this is way too slow -- it would take over 1 day to do just for the first line.
Is there any way to do this faster -- perhaps directly through dplyr or data.table?
Thank you!
Here's a sampling of the data trimmed to just columns 11-40.
!> head(rawData)
# A tibble: 6 x 30
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 39402 39451 3fv3i 19593 fk20 14p4 59304 329fj2 NA NA NA NA NA
2 39422 f203ff vmio2 vo2493 19149 59833 13404 394034 43920 349304 59302 1934 34834
3 3432f32 fe493 43943 H2344 53049 V602 3124 K148 K13 NA NA NA NA
# ... with 17 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
# X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>,
# X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>
回答1:
Based on the description, this can be done either with tidyverse
library(tidyverse)
rawData %>%
select(11:40) %>% #select the columns
#convert to logical columns
mutate_all(funs(substring(.,1,3) %in% c("295", "296", "297", "298", "299"))) %>%
reduce('+') %>% #get the rowwise sum
mutate(rawData, newcol = .) # assign a new column to the original data
Or with data.table by converting the 'data.frame' to 'data.table' (setDT(rawData)), specify the columns of interest in .SDcols, loop through the columns, convert it to logical by using the OP's condition, Reduce by taking the sum of each row and assign (:=) to 'newcol'
library(data.table)
setDT(rawData)[, newCol := Reduce('+', lapply(.SD, function(x)
substring(x, 1, 3) %chin% c("295", "296", "297", "298", "299"))),
.SDcols = 11:40]
回答2:
My comments:
- apply converts your data to a matrix
- a data frame is above all a list, not a matrix
substring()is a vectorized function (%in%too)
So, I would do:
sapply(rawData[11:40], function(var) {
substring(var, 1, 3) %in% c("295", "296", "297", "298", "299")
})
and then use rowSums() instead of apply(subResult, 1, sum).
回答3:
Try to use Rcpp package.
Here is a simple C++ program which takes two string vectors, and checks if 3 characters of elements in first are equal to the second one. So it will output logical matrix of size length(first vector) x length(second vector).
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
LogicalMatrix IndicatorMatrix(std::vector<std::string> target, std::vector<std::string> tocheck) {
int nrows = target.size();
int ncols = tocheck.size();
LogicalMatrix ind(nrows, ncols);
for(int r=0; r<nrows; r++) {
for(int c=0; c<ncols; c++) {
bool found = target[r].substr(0,3) == tocheck[c];
ind(r,c) = found;
}
}
return ind;
}
After that you can source this program into R and use your IndicatorMatrix function as if it would be a R function object.
library(Rcpp)
sourceCpp("C:/Users/Desktop/indicatorMatrix.cpp")
rep("123456", 15000000) -> x
df <- data.frame(x,x,x,x,x,x,x,x, stringsAsFactors=FALSE)
y <- c("123", "124", "345", "231", "675", "344", "222")
t1 <- Sys.time()
out <- lapply(1:length(df), function(col) {
res <- IndicatorMatrix(unlist(df[,col]), y)
res
})
t2 <- Sys.time()
t2-t1
Program searched for 8 3-character strings in 8 column data frame with 15 milions of rows in about 100 seconds. So this could be right direction for you.
来源:https://stackoverflow.com/questions/49645059/running-apply-command-on-a-very-large-data-frame