As a relatively inexperienced user of the data.table package in R, I\'ve been trying to process one text column into a large number of indicator columns (dummy variables), w
Here's another solution, that constructs a sparse matrix object instead of what you have. This shaves off a lot of time AND memory.
It produces ordered results and even with conversion to data.table it's faster than GT3 with 0L and 1L and without reordering (this could be because I use a different method for arriving at the required coordinates - I didn't go through the GT3 algo), however if you don't convert and keep it as a sparse matrix it's about 10-20x faster than GT3 (and has a much smaller memory footprint).
library(Matrix)
strings = strsplit(dt$messy_string, split = "$", fixed = TRUE)
element.map = data.table(el = elements_list, n = seq_along(elements_list), key = "el")
tmp = data.table(n = seq_along(strings), each = unlist(lapply(strings, length)))
rows = tmp[, rep(n, each = each), by = n][, V1]
cols = element.map[J(unlist(strings))][,n]
dt.sparse = sparseMatrix(rows, cols, x = 1,
dims = c(max(rows), length(elements_list)))
# optional, should be avoided until absolutely necessary
dt = cbind(dt, as.data.table(as.matrix(dt.sparse)))
setnames(dt, c('id', 'messy_string', elements_list))
The idea is to split to strings, then use a data.table as a map object to map each substring to its correct column position. From there on it's just a matter of figuring out the rows correctly and filling in the matrix.