quanteda | 易学教程

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

阅读更多关于 R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

问题 When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could

How to calculate proximity of words to a specific term in a document

阅读更多关于 How to calculate proximity of words to a specific term in a document

I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text: song <- "Far over the misty mountains cold To dungeons deep and caverns old We must away ere break of day To seek the pale enchanted gold. The dwarves of yore made mighty spells, While hammers fell like ringing bells In places deep, where dark things sleep, In hollow halls beneath the fells. For

R: LIME returns error on different feature numbers when it's not the case

阅读更多关于 R: LIME returns error on different feature numbers when it's not the case

问题 I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ). I'm doing EDA and modelling using quanteda package: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") tweet_data <- tweet_csv %>% select(author = handle, text, retweet_count, favorite_count, source_url, timestamp = time) %>% mutate(date = as_date(str_sub(timestamp, 1, 10)), hour = hour(hms(str_sub(timestamp, 12, 19))), tweet_num = row_number())

How to calculate proximity of words to a specific term in a document

阅读更多关于 How to calculate proximity of words to a specific term in a document

问题 I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text: song <- "Far over the misty mountains cold To dungeons deep and caverns old We must away ere break of day To seek the pale enchanted gold. The dwarves of yore made mighty spells, While hammers