quanteda

R construct document term matrix how to match dictionaries whose values consist of white-space separated phrases

好久不见. 提交于 2019-12-01 12:03:05
问题 When do text mining using R, after reprocessing text data, we need create a document-term matrix for further exploring. But in similar with Chinese, English also have some certain phases, such as "semantic distance", "machine learning", if you segment them into word, it have totally different meanings, I want to know how to match pre-defined dictionaries whose values consist of white-space separated terms, such as contains "semantic distance", "machine learning". if a document is "we could

How to calculate proximity of words to a specific term in a document

落花浮王杯 提交于 2019-12-01 10:55:54
I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text: song <- "Far over the misty mountains cold To dungeons deep and caverns old We must away ere break of day To seek the pale enchanted gold. The dwarves of yore made mighty spells, While hammers fell like ringing bells In places deep, where dark things sleep, In hollow halls beneath the fells. For

R: LIME returns error on different feature numbers when it's not the case

拟墨画扇 提交于 2019-12-01 10:43:24
问题 I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ). I'm doing EDA and modelling using quanteda package: library(dplyr) library(stringr) library(quanteda) library(lime) #data prep tweet_csv <- read_csv("tweets.csv") tweet_data <- tweet_csv %>% select(author = handle, text, retweet_count, favorite_count, source_url, timestamp = time) %>% mutate(date = as_date(str_sub(timestamp, 1, 10)), hour = hour(hms(str_sub(timestamp, 12, 19))), tweet_num = row_number())

How to calculate proximity of words to a specific term in a document

心已入冬 提交于 2019-12-01 08:54:23
问题 I am trying to figure out a way to calculate word proximities to a specific term in a document as well as the average proximity (by word). I know there are similar questions on SO, but nothing that gives me the answer I need or even points me somewhere helpful. So let's say I have the following text: song <- "Far over the misty mountains cold To dungeons deep and caverns old We must away ere break of day To seek the pale enchanted gold. The dwarves of yore made mighty spells, While hammers