Feature hashing in R for Text classification

佐手、 提交于 2020-02-24 09:14:49

问题


I'm trying to implement feature hashing in R to help me with a text classification problem, but i'm not sure if i'm doing it the way it should be. Part of my code is based on this post: Hashing function for mapping integers to a given range?.

My code:

random.data = function(n = 200, wlen = 40, ncol = 10){

  random.word = function(n){
    paste0(sample(c(letters, 0:9), n, TRUE), collapse = '')
  } 
  matrix(replicate(n, random.word(wlen)), ncol = ncol)   
}

feature_hash = function(doc, N){

  doc = as.matrix(doc)
  library(digest)

  idx = matrix(strtoi(substr(sapply(doc, digest), 28, 32), 16L) %% (N + 1), ncol = ncol(doc))
  sapply(1:N, function(r)apply(idx, 1, function(v)sum(v == r)))  
}

set.seed(1)
doc = random.data(50, 16, 5)
feature_hash(doc, 3)

       [,1] [,2] [,3]
 [1,]    2    0    1
 [2,]    2    1    1
 [3,]    2    0    1
 [4,]    0    2    1
 [5,]    1    1    1
 [6,]    1    0    1
 [7,]    1    2    0
 [8,]    2    0    0
 [9,]    3    1    0
[10,]    2    1    0

So, i'm basically converting the strings to integers using the last 5 hex digits of the md5 hash returned by digest. Questions:

1 - Is there any package that can do this for me? I haven't found any. 2 - Is it a good idea do use digest as hash function? If not, what can i do?

PS: I should test if it works before posting, but my files are quite big and take a lot of processing time, so i think it's more clever to someone point me in the right direction, because i'm sure i'm doing it wrong!

Thanks for nay help on this!


回答1:


I don't know any existed CRAN package for this.

However, I wrote a package for myself to do feature hashing. The source code is here: https://github.com/wush978/FeatureHashing, but the API is different.

In my case, I use it to convert a data.frame to CSRMatrix, a customized sparse matrix in the package. I also implemented a helper function to convert the CSRMatrix to Matrix::dgCMatrix. For text classification, I guess the sparse matrix will be more suitable.

If you want to try it, please check the test script here: https://github.com/wush978/FeatureHashing/blob/master/tests/test-conver-to-dgCMatrix.R

Note that I only used it in Ubuntu, so I don't know if it works for windows or macs or not. Please feel free to ask me any question of the package on https://github.com/wush978/FeatureHashing/issues.



来源:https://stackoverflow.com/questions/26446728/feature-hashing-in-r-for-text-classification

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!