text-mining

URL path similarity/string similarity algorithm

巧了我就是萌 提交于 2019-12-09 18:55:51
问题 My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process: # GROUP 1 /robots.txt # GROUP 2 /bot.html # GROUP 3 /phpMyAdmin-2.5.6-rc1/scripts/setup.php /phpMyAdmin-2.5.6-rc2/scripts/setup.php /phpMyAdmin-2.5.6/scripts/setup.php /phpMyAdmin-2.5.7-pl1/scripts/setup.php /phpMyAdmin-2.5.7/scripts/setup.php /phpMyAdmin-2.6.0-alpha/scripts/setup.php /phpMyAdmin-2.6.0-alpha2/scripts/setup.php # GROUP 4 //phpMyAdmin/ I tried Levenshtein

How to count the number of sentences in a text in R?

佐手、 提交于 2019-12-09 16:51:56
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 7 years ago . I read a text into R using the readChar() function. I aim at testing the hypothesis that the sentences of the text have as many occurrences of letter "a" as occurrences of letter "b". I recently discovered the {stringr} package, which helped me a great deal to do useful things with my text such as counting the number of characters and the total number of occurrences of each

data frame of tfidf with python

天涯浪子 提交于 2019-12-09 16:37:47
问题 I have to classify some sentiments my data frame is like this Phrase Sentiment is it good movie positive wooow is it very goode positive bad movie negative i did some preprocessing as tokenisation stop words stemming etc ... and i get Phrase Sentiment [ good , movie ] positive [wooow ,is , it ,very, good ] positive [bad , movie ] negative I need finaly to get a dataframe wich the line are the text which the value is the tf_idf and the columns are the words like that good movie wooow very bad

R tm package create matrix of Nmost frequent terms

岁酱吖の 提交于 2019-12-09 11:10:42
问题 I have a termDocumentMatrix created using the tm package in R. I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms. When I try to convert to a matrix I get this error: > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb So I tried converting to sparse matrices using Matrix package: > A <- as(mydata.dtm, "sparseMatrix") Error in as(from, "CsparseMatrix") : no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix" > B

How to scrape web content and then count frequencies of words in R?

吃可爱长大的小学妹 提交于 2019-12-09 07:33:31
问题 This is my code: library(XML) library(RCurl) url.link <- 'http://www.jamesaltucher.com/sitemap.xml' blog <- getURL(url.link) blog <- htmlParse(blog, encoding = "UTF-8") titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles traverse_each_page <- function(x){ tmp <- htmlParse(x) xpathApply(tmp, '//div[@id="mainContent"]') } pages <- lapply(titles[2:3], traverse_each_page) Here is the pseudocode: Take a xml document: http://www.jamesaltucher.com/sitemap.xml Go to each link Parse the html

What is the meaning of 'cut-off' and 'iteration' for trainings in OpenNLP?

泪湿孤枕 提交于 2019-12-08 16:53:38
问题 what is the meaning of cut-off and iteration for training in OpenNLP? or for that matter natural language processing. I need just a layman explanation of these terms. As far as I think, iteration is the number of times the algorithm is repeated and cut off is a value such that if a text has value above this cut off for some specific category it will get mapped to that category. Am I right? 回答1: Correct, the term iteration refers to the general notion of iterative algorithms , where one sets

How can I process Chinese/ Japanese characters with R [closed]

▼魔方 西西 提交于 2019-12-08 12:56:28
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center . Closed 6 years ago . I would like to be able to use a tm like package to be able to split and identify non English characters (mainly Japanese/Thai/Chinese) with R. What I would like to do is convert it into some sort of matrix like format and then run a Random Forest/Logistic regression for text classification. Is there any

How to extract word frequency from document-term matrix?

旧时模样 提交于 2019-12-08 12:38:53
问题 I am doing LDA analysis with Python. And I used the following code to create a document-term matrix corpus = [dictionary.doc2bow(text) for text in texts]. Is there any easy ways to count the word frequency over the whole corpus. Since I do have the dictionary which is a term-id list, I think I can match the word frequency with term-id. 回答1: You can use nltk in order to count word frequency in string texts from nltk import FreqDist import nltk texts = 'hi there hello there' words = nltk

Remove stopwords and tolower function slow on a Corpus in R

纵饮孤独 提交于 2019-12-08 06:37:36
问题 I have corpus roughly with 75 MB data. I am trying to use the following command tm_map(doc.corpus, removeWords, stopwords("english")) tm_map(doc.corpus, tolower) This two alone functions are taking at least 40 mins to run. I am looking for speeding up the process as I am using tdm matrix for my model. I have tried commands like gc() and memory.limit(10000000) very frequently but I am not able to speed up my process speed. I have a system with 4GB RAM and running a local database to read the

Do we need to use Stopwords filtering before POS Tagging?

﹥>﹥吖頭↗ 提交于 2019-12-08 05:55:22
问题 I am new to Text mining and NLP related stuffs.I am working on a small project where I am trying to extract information out of a few documents.I am basically doing a pos tagging and then using a chunker to find out the pattern based on the tagged words.Do I need to use Stopwords before doing this POS tagging?will using stopwords affect my POS tagger's accuracy? 回答1: Let's use this as an example to train/test a tagger: First get the corpus and stoplist >>> import nltk >>> nltk.download(