term-document-matrix

How can I tell Solr to return the hit search terms per document?

血红的双手。 提交于 2019-11-27 14:05:31
I have a question about queries in Solr. When I perform a query with multiple search terms that are all logically linked by OR (e.g. q=content:(foo OR bar OR foobar) ) than Solr returns a list of documents that all matches any of these terms. But what Solr does not return is which documents were hit by which term(s). So in the example above, what I want to know is which documents in my result list contains the term foo etc. Given this information I would be able to create a term-document matrix. So my question is: how can I tell Solr to give me that missing piece of information? I'm sure it is

list of word frequencies using R

匆匆过客 提交于 2019-11-27 12:09:50
I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm)

findAssocs for multiple terms in R

被刻印的时光 ゝ 提交于 2019-11-27 03:41:24
问题 In R I used the [tm package][1] for building a term-document matrix from a corpus of documents. My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs() can do his job. This is my code so far: library(tm) library(RWeka) txtData <- read.csv("file.csv", header = T, sep = ",") txtCorpus <- Corpus(VectorSource(txtData

list of word frequencies using R

旧巷老猫 提交于 2019-11-26 22:21:50
问题 I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer