Split a string column into several dummy variables

前端未结

关注

 6  683

醉梦人生 2020-12-01 22:13

As a relatively inexperienced user of the data.table package in R, I\'ve been trying to process one text column into a large number of indicator columns (dummy variables), w

6条回答

感动是毒 (楼主)

2020-12-01 22:38
Here's another solution, that constructs a sparse matrix object instead of what you have. This shaves off a lot of time AND memory.

It produces ordered results and even with conversion to data.table it's faster than GT3 with 0L and 1L and without reordering (this could be because I use a different method for arriving at the required coordinates - I didn't go through the GT3 algo), however if you don't convert and keep it as a sparse matrix it's about 10-20x faster than GT3 (and has a much smaller memory footprint).
```
library(Matrix)

strings = strsplit(dt$messy_string, split = "$", fixed = TRUE)
element.map = data.table(el = elements_list, n = seq_along(elements_list), key = "el")

tmp = data.table(n = seq_along(strings), each = unlist(lapply(strings, length)))

rows = tmp[, rep(n, each = each), by = n][, V1]
cols = element.map[J(unlist(strings))][,n]

dt.sparse = sparseMatrix(rows, cols, x = 1,
                         dims = c(max(rows), length(elements_list)))

# optional, should be avoided until absolutely necessary
dt = cbind(dt, as.data.table(as.matrix(dt.sparse)))
setnames(dt, c('id', 'messy_string', elements_list))
```
The idea is to split to strings, then use a data.table as a map object to map each substring to its correct column position. From there on it's just a matter of figuring out the rows correctly and filling in the matrix.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...