text-mining

Really fast word ngram vectorization in R

Deadly 提交于 2019-11-28 18:25:14
edit: The new package text2vec is excellent, and solves this problem (and many others) really well. text2vec on CRAN text2vec on github vignette that illustrates ngram tokenization I have a pretty large text dataset in R, which I've imported as a character vector: #Takes about 15 seconds system.time({ set.seed(1) samplefun <- function(n, x, collapse){ paste(sample(x, n, replace=TRUE), collapse=collapse) } words <- sapply(rpois(10000, 3) + 1, samplefun, letters, '') sents1 <- sapply(rpois(1000000, 5) + 1, samplefun, words, ' ') }) I can convert this character data to a bag-of-words

R text file and text mining…how to load data

寵の児 提交于 2019-11-28 17:58:25
I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words. I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as.... stemDocument(x, language = map_IETF(Language(x))) So assume that this is my doc "this is a test for R load" How do I load the data for text processing and to create the object x? Like @richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix: library(tm) #load text

How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?

别等时光非礼了梦想. 提交于 2019-11-28 17:23:53
问题 So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after checking "for line in file". I think I can either somehow split it, strip it, or partition, but I also wrote a regex which I used compile on and so if that returns a match object I don't think I can use that with those string based operations. Also I'm

Make dataframe of top N frequent terms for multiple corpora using tm package in R

烈酒焚心 提交于 2019-11-28 17:06:18
I have several TermDocumentMatrix s created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... ... [10th most frequent word] By definition, findFreqTerms(corpus1,N) returns all of the terms which appear N times or more. To do this by hand I could change N until I got 10 or so terms returned, but the output for findFreqTerms is listed alphabetically so unless I picked exactly the right N, I wouldn't actually know which were the top 10. I suspect that

What is CoNLL data format?

独自空忆成欢 提交于 2019-11-28 16:19:38
I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated. There are many different CoNLL formats since CoNLL is a different shared task each year. The format for CoNLL 2009 is described here . Each line represents a single word with a

tweepy Streaming API : full text

元气小坏坏 提交于 2019-11-28 13:53:42
I am using tweepy streaming API to get the tweets containing a particular hashtag . The problem that I am facing is that I am unable to extract full text of the tweet from the Streaming API . Only 140 characters are available and after that it gets truncated. Here is the code: auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) api = tweepy.API(auth) def analyze_status(text): if 'RT' in text[0:3]: return True else: return False class MyStreamListener(tweepy.StreamListener): def on_status(self, status): if not analyze_status(status

Extract text from search result URLs using R

流过昼夜 提交于 2019-11-28 12:26:44
问题 I know R a bit, but not a pro. I am working on a text-mining project using R. I searched Federal Reserve website with a keyword, say ‘inflation’. The second page of the search result has the URL: (https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation). This page has 10 search results (10 URLs). I want to write a code in R which will ‘read’ the page corresponding to each of those 10 URLs and extract the texts from those web pages to .txt files. My only

findAssocs for multiple terms in R

半世苍凉 提交于 2019-11-28 10:34:58
In R I used the [tm package][1] for building a term-document matrix from a corpus of documents. My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs() can do his job. This is my code so far: library(tm) library(RWeka) txtData <- read.csv("file.csv", header = T, sep = ",") txtCorpus <- Corpus(VectorSource(txtData$text)) ...further preprocessing #Tokenizer for n-grams and passed on to the term-document matrix

R Regular Expression Lookbehind

喜夏-厌秋 提交于 2019-11-28 08:03:20
问题 I have a vector filled with strings of the following format: <year1><year2><id1><id2> the first entries of the vector looks like this: 199719982001 199719982002 199719982003 199719982003 For the first entry we have: year1 = 1997, year2 = 1998, id1 = 2, id2 = 001. I want to write a regular expression that pulls out year1, id1, and the digits of id2 that are not zero. So for the first entry the regex should output: 199721. I have tried doing this with the stringr package, and created the

How to create a good NER training model in OpenNLP?

霸气de小男生 提交于 2019-11-28 07:32:29
I just have started with OpenNLP. I need to create a simple training model to recognize name entities. Reading the doc here https://opennlp.apache.org/docs/1.8.0/apidocs/opennlp-tools/opennlp/tools/namefind I see this simple text to train the model: <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . <START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British