text-mining

“RTextTools” create_matrix got an error

冷暖自知 提交于 2019-12-03 12:04:35
问题 I was running RTextTools package to build a text classification model. And when I prepare the prediction dataset and tried to transform it in to matrix. I got error as: Error in if (attr(weighting, "Acronym") == "tf-idf") weight <- 1e-09 : argument is of length zero My code is as below: table<-read.csv("traintest.csv",header = TRUE) dtMatrix <- create_matrix(table["COMMENTS"]) container <- create_container(dtMatrix, table$LIKELIHOOD_TO_RECOMMEND, trainSize=1:5000,testSize=5001:10000, virgin

How to find the closest word to a vector using word2vec

被刻印的时光 ゝ 提交于 2019-12-03 11:00:39
问题 I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors: array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) Is there a straight forward way to find the most similar word in my training data to this vector? Or the only solution is to calculate the cosine similarity between this vector and the vectors of each word in my training data, then select the closest

Counting syllables

北城以北 提交于 2019-12-03 10:29:15
I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid. Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count. so for instance: x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle') would yield: 1, 1, 2, 2, 1, 3 Each number corresponding the the number of syllables in the word. Ari B. Friedman Some tools for NLP are available here: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html The task is non-trivial though. More hints (including an algorithm you

How to identify ideas and concepts in a given text

柔情痞子 提交于 2019-12-03 09:41:31
问题 I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained: Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, or even better a photograph? It'd be great to be able to detect that the person has asked for a photograph of Mr Jones. I could take a really naïve approach and

arabic text mining using R [closed]

那年仲夏 提交于 2019-12-03 09:16:33
I am a new user and I just want to get help with my work on R. i am doing Arabic text mining and I would love to have some help anyone have experience in this fields. So far I felt to normalize the Arabic text and even R doesn't print the Arabic characters in the console. I am stuck now and I don’t know is it right to change the language like doing the mining in Weka or any other way. Can anyone advise me if anyone achieved anything in mining Arabic text using R? By the way I am working on Arabic tweets data set analysis. It took my one month to fetch the data. And I don’t know how long will

How to scrape web content and then count frequencies of words in R?

血红的双手。 提交于 2019-12-03 08:51:53
This is my code: library(XML) library(RCurl) url.link <- 'http://www.jamesaltucher.com/sitemap.xml' blog <- getURL(url.link) blog <- htmlParse(blog, encoding = "UTF-8") titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles traverse_each_page <- function(x){ tmp <- htmlParse(x) xpathApply(tmp, '//div[@id="mainContent"]') } pages <- lapply(titles[2:3], traverse_each_page) Here is the pseudocode: Take a xml document: http://www.jamesaltucher.com/sitemap.xml Go to each link Parse the html content of each link Extract the text inside div id="mainContent" Count the frequencies of each word that

Wordcloud with a specific shape [closed]

☆樱花仙子☆ 提交于 2019-12-03 07:52:20
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 2 years ago . Suppose, I have a dataframe which contains some words with their frequencies. I want to create a wordcloud in R with the words inside the shape of a logo, for example, the twitter logo just like this: 回答1: You can use wordcloud2 package for that. It allows you to use any image as the mask. Just put in the

Row sum for large term-document matrix / simple_triplet_matrix ?? {tm package}

空扰寡人 提交于 2019-12-03 07:06:20
问题 So I have a very large term-document matrix: > class(ph.DTM) [1] "TermDocumentMatrix" "simple_triplet_matrix" > ph.DTM A term-document matrix (109996 terms, 262811 documents) Non-/sparse entries: 3705693/28904453063 Sparsity : 100% Maximal term length: 191 Weighting : term frequency (tf) How do I get the rowSum (frequency) of each term? I tried: > apply(ph.DTM, 1, sum) Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA In addition: Warning message: In nr * nc : NAs produced by

Python or Java for text processing (text mining, information retrieval, natural language processing) [closed]

两盒软妹~` 提交于 2019-12-03 05:17:31
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . I'm soon to start on a new project where I am going to do lots of text processing tasks like searching, categorization/classifying, clustering, and so on. There's going to be a huge amount of documents that need to be processed; probably millions of documents. After the

Are there APIs for text analysis/mining in Java? [closed]

痴心易碎 提交于 2019-12-03 03:43:43
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words,