R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

好久不见. 提交于 2019-12-01 12:03:05

问题


When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could use machine learning method to calculate the words semantic distance", when applying this document on the dictionary["semantic distance", "machine learning"], it will return a 1x2 matrix:[semantic distance, 1;machine learning,1]


回答1:


It's possible to do this with quanteda, although it requires the construction of a dictionary for each phrase, and then pre-processing the text to convert the phrases into tokens. To become a "token", the phrases need to be joined by something other than whitespace -- here, the "_" character.

Here are some example texts, including the phrase in the OP. I added two additional texts for the illustration -- below, the first row of the document-feature matrix produces the requested answer.

txt <- c("We could use machine learning method to calculate the words semantic distance.",
         "Machine learning is the best sort of learning.",
         "The distance between semantic distance and machine learning is machine driven.")

The current signature for phrase to token requires the phrases argument to be a dictionary or a collocations object. Here we will make it a dictionary:

mydict <- dictionary(list(machine_learning = "machine learning", 
                          semantic_distance = "semantic distance"))

Then we pre-process the text to convert the dictionary phrases to their keys:

toks <- tokens(txt) %>%
    tokens_compound(mydict)
toks
# tokens from 3 documents.
# text1 :
# [1] "We"                "could"             "use"               "machine_learning" 
# [5] "method"            "to"                "calculate"         "the"              
# [9] "words"             "semantic_distance" "."                
# 
# text2 :
# [1] "Machine_learning" "is"               "the"              "best"            
# [5] "sort"             "of"               "learning"         "."               
# 
# text3 :
# [1] "The"               "distance"          "between"           "semantic_distance"
# [5] "and"               "machine_learning"  "is"                "machine"          
# [9] "driven"            "."    

Finally, we can construct the document-feature matrix, keeping all phrases using the default "glob" pattern match for any feature that includes the underscore character:

mydfm <- dfm(toks, select = "*_*")
mydfm
## Document-feature matrix of: 3 documents, 2 features.
## 3 x 2 sparse Matrix of class "dfm"
##        features
## docs    machine_learning semantic_distance
##   text1                1                 1
##   text2                1                 0
##   text3                1                 1

(Answer updated for >= v0.9.9)



来源:https://stackoverflow.com/questions/36732659/r-construct-document-term-matrix-how-to-match-dictionaries-whose-values-consist

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!