Manipulate char vectors inside a data.table object in R

空扰寡人 提交于 2020-01-06 19:34:11


I'm a bit new still to using data.table and understanding all its subtleties. I've looked in the doc and in other examples in SO but couldn't find what I want, so please help !

I have a data.table which is basically a char vector (each entry being a sentence)

DT=c("I love you","she loves me")
colnames(DT) <- "text"

# > DT
#            text
# 1:   I love you
# 2: she loves me

What I'd like to do, is to be able to perform some basic string operations inside the DT object. For example, add a new column where I would have a char vector for which each entry is a WORD from the string in the "text" column.

so I'd like to have for example a new column charvec where

> DT[1]$charvec
[1] "I" "love "you"

Of course, I would like to do it the data.table way, ultra-fast, because I need to do this kind of things on fils which are >1Go file, and use more complex and computation-heavy functions. So no use of APPLY, LAPPLY, and MAPPLY

My closest attempt to do something which looks like it is as follow:

myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
#            text      charvec
# 1:   I love you   I,love,you
# 2: she loves me she,loves,me

For example, to make a function which removes the first word of each sentence, I did this

myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
#            text  charvec
# 1:   I love you love,you
# 2: she loves me loves,me

the trouble is, in the column charvec, i've got a list and not a vector...

> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"

1) how can i get to do what i want ? other kind of functions i'm thinking to use is subsetting the char vector, or applying some hash to it, etc..

2) BTW, can I get to DU2 or DV2 in one line instead of two lines ? 3) i don't understand well the syntax for data.table. why is it that with the command list() inside the [..], the column V1 vanishes ? 4) on another thread, i read a bit about the function cSplit.

. is it any good ? is it a function adapted to data.table objects ?

thanks very much


thanks to @Ananda Mahto Perhaps i should make myself more clear of my ultimate objective I have a huge file of 10,000,000 sentences stored as string. As a first step for that project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences wouldn't even get in my memory, so i did first split into 10 files of 1,000,000 sentences, that would be around a 10x 1Go files. the following code takes several minutes on my laptop just for a single file.

library(data.table); library(digest);
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
colnames(DT) <- "text"
rawdata <- DT

hash2 <- function(word){ #using library(digest)



        colnames(rawdata) <- "sentence"
        rawdata <- lapply(rawdata,strsplit," ")

        sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
        hash_list <- sapply(sentences_begin,hash2)
        # remove(rawdata)
})) ## end of print system.time for loading the data

I know I'm pushing here R to its limits, but i'm struggling to find faster implementations, and i was thinking about data.table features...hence all my questions

Here is an implementation excluding lapply, but its actually slower !

myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]

myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]

rebuildsentence <- function(S){
        paste(S,collapse=" ") }

myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}

DW1 <- DV2[,myfun3(charvec),by=text]

})) #end of system.time

In this implementation with data file, no lapply, so i hoped the hashing would be faster. However because in every column i have a list instead of a char vector, this may slow significantly (?) the whole thing.

Using the first code above (with lapply/sapply) took more than 1 hour on my laptop. I hoped to speed that with a more efficient data structure ?. People using Python, Java etc... do a similar job in a few seconds.

Of course, another road would be to find a faster hash function but I assumed the one in digest package was already optimized.


I'm not really sure what you're after, but you can try cSplit_l from my "splitstackshape" package to get to your list column:

DU <- cSplit_l(DT, "DT", " ")

Then, you can write a function like the following to remove values from the list column:

RemovePos <- function(inList, pos = 1) {
  lapply(inList, function(x) x[-c(pos[pos <= length(x)])])

Example usage:

DU[, list(RemovePos(DT_list, 1)), by = DT]
#              DT       V1
# 1:   I love you love,you
# 2: she loves me loves,me
DU[, list(RemovePos(DT_list, 2)), by = DT]
#              DT     V1
# 1:   I love you  I,you
# 2: she loves me she,me
DU[, list(RemovePos(DT_list, c(1, 2))), by = DT]
#              DT  V1
# 1:   I love you you
# 2: she loves me  me


Based on your loathe of `lapply, maybe you can try something like the following:

## make a copy of your "text" column
DT[, vals := text]

## Use `cSplit` to create a "long" dataset. 
## Add a column to indicate the word's position in the text.
DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][]
#            text  vals ind
# 1:   I love you     I   1
# 2:   I love you  love   2
# 3:   I love you   you   3
# 4: she loves me   she   1
# 5: she loves me loves   2
# 6: she loves me    me   3

## Now, you can extract values easily
DTL[ind == 1]
#            text vals ind
# 1:   I love you    I   1
# 2: she loves me  she   1
DTL[ind %in% c(1, 3)]
#            text vals ind
# 1:   I love you    I   1
# 2:   I love you  you   3
# 3: she loves me  she   1
# 4: she loves me   me   3

Update 2

I don't know what type of timings you are getting, but as I mentioned in a comment, you can perhaps try using regular expressions so that you don't have to split and then paste the string back together.

Here's a sample....

Set up some data to play with:

DT <- data.table(
  text = c("This is a sentence with a lot of words.",
           "This is a sentence with some more words.",
           "Words and words and even some more words.",
           "But, I don't know how you want to deal with punctuation...",
           "Just one more sentence, for easy multiplication.")

DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE))
DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))

Test the gsub pattern to extract 5 words from each sentence....

## Regex to extract first five words -- this should work....
patt <- "^((?:\\S+\\s+){4}\\S+).*"

## Check out some of the timings
system.time(temp <- DT2[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#    0.03    0.00    0.03 
system.time(temp2 <- DT3[, gsub(patt, "\\1", text)])
#    user  system elapsed 
#       3       0       3 
# [1] "This is a sentence with"     "This is a sentence with"     "Words and words and even"   
# [4] "But, I don't know how"       "Just one more sentence, for" "This is a sentence with" 

My guess at what you're looking to do....

## I'm assuming you want something like this....
## Takes about a minute on my system. 
## ... but note the system time for the creation of "temp2" (without digest)
## Not sure if I interpreted your hash requirement correctly....
system.time(out <- DT3[
  , firstFive := gsub(patt, "\\1", text)][
  , firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][])
#    user  system elapsed 
#   62.14    0.05   62.20 

#                                                          text                   firstFive firstFiveHash
# 1:                    This is a sentence with a lot of words.     This is a sentence with    4179639471
# 2:                   This is a sentence with some more words.     This is a sentence with    4179639471
# 3:                  Words and words and even some more words.    Words and words and even    2556713080
# 4: But, I don't know how you want to deal with punctuation...       But, I don't know how    3765680401
# 5:           Just one more sentence, for easy multiplication. Just one more sentence, for     298317689
# 6:                    This is a sentence with a lot of words.     This is a sentence with    4179639471

