term-document-matrix | 易学教程

R : Finding the top 10 terms associated with the term 'fraud' across documents in a Document Term Matrix in R

阅读更多关于 R : Finding the top 10 terms associated with the term 'fraud' across documents in a Document Term Matrix in R

问题 I have a corpus of 39 text files named by the year - 1945.txt, 1978.txt.... 2013.txt. I've imported them into R and created a Document Term Matrix using TM package. I'm trying to investigate how words associated with term'fraud' have changed over years from 1945 to 2013. The desired output would be a 39 by 10/5 matrix with years as row titles and top 10 or 5 terms as columns. Any help would be greatly appreciated. Thanks in advance. Structure of my TDM: > str(ytdm) List of 6 $ i : int [1:6791

Frequency Per Term - R TM DocumentTermMatrix

阅读更多关于 Frequency Per Term - R TM DocumentTermMatrix

问题 I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them. Ideally, I would like: Term # "the" 200 "is" 400 "a" 200 Currently my code is: library(tm) common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you") x <- Corpus(VectorSource(results)) x <- tm_map(x, stripWhitespace) x <- tm_map(x, removeNumbers) x <

Find frequency of a custom word in R TermDocumentMatrix using TM package

阅读更多关于 Find frequency of a custom word in R TermDocumentMatrix using TM package

问题 I turned about 50,000 rows of varchar data into a corpus, and then proceeded to clean said corpus using the TM package, getting ride of stopwords, punctuation, and numbers. I then turned it into a TermDocumentMatrix and used the functions findFreqTerms and findMostFreqTerms to run text analysis. findMostFreqTerms return the common words, and the number of times it shows up in the data. However, I want to use a function that says search for "word" and return how many times "word" appears in

twitter data <- error in termdocumentmatrix

阅读更多关于 twitter data

问题 # search for a term in twitter rdmTweets <- searchTwitteR("machine learning", n=500, lang="en") dtm.control <- list( tolower = TRUE, removePunctuation = TRUE, removeNumbers = TRUE, removestopWords = TRUE, stemming = TRUE, # false for sentiment wordLengths = c(3, "inf")) # create a dataframe around the results df <- do.call("rbind", lapply(rdmTweets, as.data.frame)) # Here are the columns names(df) # And some example content head(df,10) counts = table(df$screenName) barplot(counts) # Plot the

how to read and write TermDocumentMatrix in r?

阅读更多关于 how to read and write TermDocumentMatrix in r?

问题 I made wordcloud using a csv file in R. I used TermDocumentMatrix method in the tm package. Here is my code: csvData <- read.csv("word", encoding = "UTF-8", stringsAsFactors = FALSE) Encoding(csvData$content) <- "UTF-8" # useSejongDic() - KoNLP package nouns <- sapply(csvData$content, extractNoun, USE.NAMES = F) #create Corpus myCorpus <- Corpus(VectorSource(nouns)) myCorpus <- tm_map(myCorpus, removePunctuation) # remove numbers myCorpus <- tm_map(myCorpus, removeNumbers) #remove StopWord

R - slowly working lapply with sort on ordered factor

阅读更多关于 R - slowly working lapply with sort on ordered factor

问题 Based on the question More efficient means of creating a corpus and DTM I've prepared my own method for building a Term Document Matrix from a large corpus which (I hope) do not require Terms x Documents memory. sparseTDM <- function(vc){ id = unlist(lapply(vc, function(x){x$meta$id})) content = unlist(lapply(vc, function(x){x$content})) out = strsplit(content, "\\s", perl = T) names(out) = id lev.terms = sort(unique(unlist(out))) lev.docs = id v1 = lapply( out, function(x, lev) { sort(as

R - slowly working lapply with sort on ordered factor

阅读更多关于 R - slowly working lapply with sort on ordered factor

How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

阅读更多关于 How to build a Term-Document-Matrix from a set of texts and a specific set of terms (tags)?

问题 I have two sets of data: a set of tags (single words like php , html , etc) a set of texts I wish now to build a Term-Document-Matrix representing the number occurrences of the tags element in the text element. I have looked into R library tm, and the TermDocumentMatrix function, but I do not see the possibility to specify the tags as input. Is there a way to do that? I am open to any tool (R, Python, other), although using R would be great. Let's set the data as: TagSet <- data.frame(c("c",

Does tm automatically ignore the very short strings?

阅读更多关于 Does tm automatically ignore the very short strings?

问题 Here is my code: example 1: a <- c("ab cd de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs Terms 1 2 12v 0 1 a23 0 1 alkalin 0 1 batteri 0 1 energ 0 1 Looks like the first string in a is ignored. example 2 a <- c("abcd cde de","ENERGIZER A23 12V ALKALINE BATTERi") a1 <- VCorpus(VectorSource(a)) a2 <- TermDocumentMatrix(a1,control = list(stemming=T)) inspect(a2) The result is: Docs

How can I tell Solr to return the hit search terms per document?

阅读更多关于 How can I tell Solr to return the hit search terms per document?

问题 I have a question about queries in Solr. When I perform a query with multiple search terms that are all logically linked by OR (e.g. q=content:(foo OR bar OR foobar) ) than Solr returns a list of documents that all matches any of these terms. But what Solr does not return is which documents were hit by which term(s). So in the example above, what I want to know is which documents in my result list contains the term foo etc. Given this information I would be able to create a term-document