text-mining

R Text Mining: Counting the number of times a specific word appears in a corpus?

╄→гoц情女王★ 提交于 2019-11-29 23:27:01
问题 I have seen this question answered in other languages but not in R. [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus. Is there a way to do this in TM package? (Or another related package) For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many

Use R to convert PDF files to text files for text mining

允我心安 提交于 2019-11-29 22:24:21
I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe" system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F) # get txt-file name and open it filetxt <- sub(".pdf", ".txt", dest) shell.exec(filetxt) By this, I am converting one pdf file to one .txt file and then copying the abstract in another .txt file and compile it manually. This work is

How to determine the (natural) language of a document?

北战南征 提交于 2019-11-29 22:24:12
I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on that, the program has to decide which of the two languages the document is written in. Is there any "standard" algorithm for this problem that can be implemented in a few hours' time? Or alternatively, a free .NET library or toolkit that can do this? I know about LingPipe , but it is Java Not free for "semi-commercial" usage This problem seems to be surprisingly hard. I checked out the Google AJAX Language API (which I found

Save and reuse TfidfVectorizer in scikit learn

左心房为你撑大大i 提交于 2019-11-29 18:24:22
问题 I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object for reusing it later. I tried to use pickle, but it gave the following error. loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** TypeError: can't pickle instancemethod objects I tried using joblib in sklearn.externals, which again gave similar error. Is there any way to save this object so that I can reuse it later? Here is my full object: class changeToMatrix(object):

Extract text from search result URLs using R

人盡茶涼 提交于 2019-11-29 17:35:45
I know R a bit, but not a pro. I am working on a text-mining project using R. I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of the search result has the URL: ( https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation ). This page has 10 search results (10 URLs). I want to write a code in R which will ‘read’ the page corresponding to each of those 10 URLs and extract the texts from those web pages to .txt files. My only input is the above mentioned URL. I appreciate your help. If there is any similar older post, please

Adding custom stopwords in R tm

白昼怎懂夜的黑 提交于 2019-11-29 17:10:48
问题 I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list? 回答1: stopwords just provides you with a vector of words, just c ombine your own ones to this. tm_map(abs, removeWords, c(stopwords("english"),"my","custom","words")) 回答2: Save your custom stop words in a csv file (ex: word.csv ). library(tm) stopwords <- read.csv("word.csv", header =

How to extract textual contents from a web page? [closed]

泪湿孤枕 提交于 2019-11-29 15:51:58
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . I'm developing an application in java which can take textual information from different web pages and will summarize it into one page.For example,suppose I have a news on different web pages like Hindu,Times of

Removing overly common words (occur in more than 80% of the documents) in R

南楼画角 提交于 2019-11-29 14:55:40
问题 I am working with the 'tm' package in to create a corpus. I have done most of the preprocessing steps. The remaining thing is to remove overly common words (terms that occur in more than 80% of the documents). Can anybody help me with this? dsc <- Corpus(dd) dsc <- tm_map(dsc, stripWhitespace) dsc <- tm_map(dsc, removePunctuation) dsc <- tm_map(dsc, removeNumbers) dsc <- tm_map(dsc, removeWords, otherWords1) dsc <- tm_map(dsc, removeWords, otherWords2) dsc <- tm_map(dsc, removeWords,

R Regular Expression Lookbehind

倾然丶 夕夏残阳落幕 提交于 2019-11-29 13:52:44
I have a vector filled with strings of the following format: <year1><year2><id1><id2> the first entries of the vector looks like this: 199719982001 199719982002 199719982003 199719982003 For the first entry we have: year1 = 1997, year2 = 1998, id1 = 2, id2 = 001. I want to write a regular expression that pulls out year1, id1, and the digits of id2 that are not zero. So for the first entry the regex should output: 199721. I have tried doing this with the stringr package, and created the following regex: "^\\d{4}|\\d{1}(?<=\\d{3}$)" to pull out year1 and id1, however when using the lookbehind i

Which NLP toolkit to use in JAVA? [closed]

孤街浪徒 提交于 2019-11-29 10:34:21
i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results. I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website. What do I have : A list of articles returned from a search. Each article has an ID and an abstract. The idea is to get keywords from each abstract text. And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the