问题
I'm a bit new still to using data.table and understanding all its subtleties. I've looked in the doc and in other examples in SO but couldn't find what I want, so please help !
I have a data.table which is basically a char vector (each entry being a sentence)
DT=c("I love you","she loves me")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
# > DT
# text
# 1: I love you
# 2: she loves me
What I'd like to do, is to be able to perform some basic string operations inside the DT object. For example, add a new column where I would have a char vector for which each entry is a WORD from the string in the "text" column.
so I'd like to have for example a new column charvec where
> DT[1]$charvec
[1] "I" "love "you"
Of course, I would like to do it the data.table way, ultra-fast, because I need to do this kind of things on fils which are >1Go file, and use more complex and computation-heavy functions. So no use of APPLY, LAPPLY, and MAPPLY
My closest attempt to do something which looks like it is as follow:
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
# > DU2
# text charvec
# 1: I love you I,love,you
# 2: she loves me she,loves,me
For example, to make a function which removes the first word of each sentence, I did this
myfun2 <- function(l){l[[1]][-1]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
# > DV2
# text charvec
# 1: I love you love,you
# 2: she loves me loves,me
the trouble is, in the column charvec, i've got a list and not a vector...
> str(DU2[1]$charvec)
# List of 1
# $ : chr [1:3] "I" "love" "you"
1) how can i get to do what i want ? other kind of functions i'm thinking to use is subsetting the char vector, or applying some hash to it, etc..
2) BTW, can I get to DU2 or DV2 in one line instead of two lines ?
3) i don't understand well the syntax for data.table. why is it that with the command list()
inside the [..], the column V1 vanishes ?
4) on another thread, i read a bit about the function cSplit
.
. is it any good ? is it a function adapted to data.table objects ?
thanks very much
UPDATE
thanks to @Ananda Mahto Perhaps i should make myself more clear of my ultimate objective I have a huge file of 10,000,000 sentences stored as string. As a first step for that project, I want to make a hash of the first 5 words of each sentence. 10,000,000 sentences wouldn't even get in my memory, so i did first split into 10 files of 1,000,000 sentences, that would be around a 10x 1Go files. the following code takes several minutes on my laptop just for a single file.
library(data.table); library(digest);
num_row=1000000
DT <- fread("sentences.txt",nrows=num_row,header=FALSE,sep="\t",colClasses="character")
DT=as.data.table(DT)
colnames(DT) <- "text"
setkey(DT,text)
rawdata <- DT
hash2 <- function(word){ #using library(digest)
as.numeric(paste("0x",digest(word,algo="murmur32"),sep=""))
}
then,
print(system.time({
colnames(rawdata) <- "sentence"
rawdata <- lapply(rawdata,strsplit," ")
sentences_begin <- lapply(rawdata$sentence,function(x){x[2:6]})
hash_list <- sapply(sentences_begin,hash2)
# remove(rawdata)
})) ## end of print system.time for loading the data
I know I'm pushing here R to its limits, but i'm struggling to find faster implementations, and i was thinking about data.table features...hence all my questions
Here is an implementation excluding lapply, but its actually slower !
print(system.time({
myfun1 <- function(sentence){strsplit(sentence," ")}
DU1 <- DT[,myfun1(text),by=text]
DU2 <- DU1[,list(charvec=list(V1)),by=text]
myfun2 <- function(l){l[[1]][2:6]}
DV1 <- DU2[,myfun2(charvec),by=text]
DV2 <- DV1[,list(charvec=list(V1)),by=text]
rebuildsentence <- function(S){
paste(S,collapse=" ") }
myfun3 <- function(l){hash2(rebuildsentence(l[[1]]))}
DW1 <- DV2[,myfun3(charvec),by=text]
})) #end of system.time
In this implementation with data file, no lapply, so i hoped the hashing would be faster. However because in every column i have a list instead of a char vector, this may slow significantly (?) the whole thing.
Using the first code above (with lapply
/sapply
) took more than 1 hour on my laptop. I hoped to speed that with a more efficient data structure ?. People using Python, Java etc... do a similar job in a few seconds.
Of course, another road would be to find a faster hash function but I assumed the one in digest
package was already optimized.
回答1:
I'm not really sure what you're after, but you can try cSplit_l
from my "splitstackshape" package to get to your list column:
library(splitstackshape)
DU <- cSplit_l(DT, "DT", " ")
Then, you can write a function like the following to remove values from the list column:
RemovePos <- function(inList, pos = 1) {
lapply(inList, function(x) x[-c(pos[pos <= length(x)])])
}
Example usage:
DU[, list(RemovePos(DT_list, 1)), by = DT]
# DT V1
# 1: I love you love,you
# 2: she loves me loves,me
DU[, list(RemovePos(DT_list, 2)), by = DT]
# DT V1
# 1: I love you I,you
# 2: she loves me she,me
DU[, list(RemovePos(DT_list, c(1, 2))), by = DT]
# DT V1
# 1: I love you you
# 2: she loves me me
Update
Based on your loathe of `lapply, maybe you can try something like the following:
## make a copy of your "text" column
DT[, vals := text]
## Use `cSplit` to create a "long" dataset.
## Add a column to indicate the word's position in the text.
DTL <- cSplit(DT, "vals", " ", "long")[, ind := sequence(.N), by = text][]
DTL
# text vals ind
# 1: I love you I 1
# 2: I love you love 2
# 3: I love you you 3
# 4: she loves me she 1
# 5: she loves me loves 2
# 6: she loves me me 3
## Now, you can extract values easily
DTL[ind == 1]
# text vals ind
# 1: I love you I 1
# 2: she loves me she 1
DTL[ind %in% c(1, 3)]
# text vals ind
# 1: I love you I 1
# 2: I love you you 3
# 3: she loves me she 1
# 4: she loves me me 3
Update 2
I don't know what type of timings you are getting, but as I mentioned in a comment, you can perhaps try using regular expressions so that you don't have to split and then paste the string back together.
Here's a sample....
Set up some data to play with:
library(data.table)
DT <- data.table(
text = c("This is a sentence with a lot of words.",
"This is a sentence with some more words.",
"Words and words and even some more words.",
"But, I don't know how you want to deal with punctuation...",
"Just one more sentence, for easy multiplication.")
)
DT2 <- rbindlist(replicate(10000/nrow(DT), DT, FALSE))
DT3 <- rbindlist(replicate(1000000/nrow(DT), DT, FALSE))
Test the gsub pattern to extract 5 words from each sentence....
## Regex to extract first five words -- this should work....
patt <- "^((?:\\S+\\s+){4}\\S+).*"
## Check out some of the timings
system.time(temp <- DT2[, gsub(patt, "\\1", text)])
# user system elapsed
# 0.03 0.00 0.03
system.time(temp2 <- DT3[, gsub(patt, "\\1", text)])
# user system elapsed
# 3 0 3
head(temp)
# [1] "This is a sentence with" "This is a sentence with" "Words and words and even"
# [4] "But, I don't know how" "Just one more sentence, for" "This is a sentence with"
My guess at what you're looking to do....
## I'm assuming you want something like this....
## Takes about a minute on my system.
## ... but note the system time for the creation of "temp2" (without digest)
## Not sure if I interpreted your hash requirement correctly....
system.time(out <- DT3[
, firstFive := gsub(patt, "\\1", text)][
, firstFiveHash := hash2(firstFive), by = 1:nrow(DT3)][])
# user system elapsed
# 62.14 0.05 62.20
head(out)
# text firstFive firstFiveHash
# 1: This is a sentence with a lot of words. This is a sentence with 4179639471
# 2: This is a sentence with some more words. This is a sentence with 4179639471
# 3: Words and words and even some more words. Words and words and even 2556713080
# 4: But, I don't know how you want to deal with punctuation... But, I don't know how 3765680401
# 5: Just one more sentence, for easy multiplication. Just one more sentence, for 298317689
# 6: This is a sentence with a lot of words. This is a sentence with 4179639471
来源:https://stackoverflow.com/questions/33785594/manipulate-char-vectors-inside-a-data-table-object-in-r