text-mining | 易学教程

R text file and text mining…how to load data

阅读更多关于 R text file and text mining…how to load data

问题 I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words. I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such as.... stemDocument(x, language = map_IETF(Language(x))) So assume that this is my doc "this is a test for R load" How do I load the data for text processing and to create the object x? 回答1: Like @richiemorrisroe I found this poorly documented. Here

Make dataframe of top N frequent terms for multiple corpora using tm package in R

阅读更多关于 Make dataframe of top N frequent terms for multiple corpora using tm package in R

问题 I have several TermDocumentMatrix s created with the tm package in R. I want to find the 10 most frequent terms in each set of documents to ultimately end up with an output table like: corpus1 corpus2 "beach" "city" "sand" "sidewalk" ... ... [10th most frequent word] By definition, findFreqTerms(corpus1,N) returns all of the terms which appear N times or more. To do this by hand I could change N until I got 10 or so terms returned, but the output for findFreqTerms is listed alphabetically so

What is CoNLL data format?

阅读更多关于 What is CoNLL data format?

问题 I am new to text mining. I am using a open source jar (Mate Parser) which gives me output in a CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction. But i am able to understand some of the output but not able to comprehend the CoNLL data format. Can any one help me in making me understand the CoNLL data format?? Any kind of pointers would be appreciated. 回答1: There are many different CoNLL formats since CoNLL is a different shared

Recognize PDF table using R

阅读更多关于 Recognize PDF table using R

I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables. Is there a way to use R to recognize and extract only tables? Awsome question, I wondered about the same thing recently, thanks! I did it, with tabulizer ‘0.2.2’ as @hrbrmstr suggests too. If you are using R version 3.5.2, I'm providing following solution. Install the three packages in specific order: # install.packages("rJava") # library(rJava) # load and attach 'rJava' now # install

tweepy Streaming API : full text

阅读更多关于 tweepy Streaming API : full text

问题 I am using tweepy streaming API to get the tweets containing a particular hashtag . The problem that I am facing is that I am unable to extract full text of the tweet from the Streaming API . Only 140 characters are available and after that it gets truncated. Here is the code: auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) api = tweepy.API(auth) def analyze_status(text): if 'RT' in text[0:3]: return True else: return False

Count common words in two strings

阅读更多关于 Count common words in two strings

问题 I have two strings: a <- "Roy lives in Japan and travels to Africa" b <- "Roy travels Africa with this wife" I am looking to get a count of common words between these strings. The answer should be 3. "Roy" "travels" "Africa" being the common words This is what I tried: stra <- as.data.frame(t(read.table(textConnection(a), sep = " "))) strb <- as.data.frame(t(read.table(textConnection(b), sep = " "))) Taking unique to avoid repeat counting stra_unique <-as.data.frame(unique(stra$V1)) strb

Visualise distances between texts

阅读更多关于 Visualise distances between texts

问题 I'm working on a research project for school. I've written some text mining software that analyzes legal texts in a collection and spits out a score that indicates how similar they are. I ran the program to compare each text with every other text, and I have data like this (although with many more points): codeofhammurabi.txt crete.txt 0.570737 codeofhammurabi.txt iraqi.txt 1.13475 codeofhammurabi.txt magnacarta.txt 0.945746 codeofhammurabi.txt us.txt 1.25546 crete.txt iraqi.txt 0.329545

bigrams instead of single words in termdocument matrix using R and Rweka

阅读更多关于 bigrams instead of single words in termdocument matrix using R and Rweka

I've found a way to use use bigrams instead of single tokens in a term-document matrix. The solution has been posed on stackoverflow here: findAssocs for multiple terms in R The idea goes something like this: library(tm) library(RWeka) data(crude) #Tokenizer for n-grams and passed on to the term-document matrix constructor BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) txtTdmBi <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) However the final line gives me the error: Error in rep(seq_along(x), sapply(tflist, length)) : invalid 'times'

findAssocs for multiple terms in R

阅读更多关于 findAssocs for multiple terms in R

问题 In R I used the [tm package][1] for building a term-document matrix from a corpus of documents. My goal is to extract word-associations from all bigrams in the term document matrix and return for each the top three or some. Therefore I'm looking for a variable that holds all row.names from the matrix so the function findAssocs() can do his job. This is my code so far: library(tm) library(RWeka) txtData <- read.csv("file.csv", header = T, sep = ",") txtCorpus <- Corpus(VectorSource(txtData

Keep document ID with R corpus

阅读更多关于 Keep document ID with R corpus

问题 I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below: I have a dataframe: ID and Text (Simple document id/name and then some text ) I have two issues: Part 1 : How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm). Part 2 : I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus,