tm | 易学教程

How to save an R Corpus to disk

阅读更多关于 How to save an R Corpus to disk

问题 I have a large R Corpus object, using the tm package, made up of millions of small documents. How do I save that to disk as a single text file for use with other programs (such as word2vec)? I tried writeCorpus(myCorpus) but that wrote out a million tiny text files that blew up my Mac! I'm not very proficient in R, so any help on how to do this would be much, much appreciated. Thank you! 回答1: Try : writeLines(as.character(mycorpus), con="mycorpus.txt") But I don't know if it will be efficient

R tm removeWords stopwords is not removing stopwords

阅读更多关于 R tm removeWords stopwords is not removing stopwords

问题 I'm using the R tm package, and find that almost none of the tm_map functions that remove elements of text are working for me. By 'working' I mean for example, I'll run: d <- tm_map(d, removeWords, stopwords('english')) but then when I run ddtm <- DocumentTermMatrix(d, control = list( weighting = weightTfIdf, minWordLength = 2)) findFreqTerms(ddtm, 10) I still get: [1] the this ...etc., and a bunch of other stopwords. I see no error indicating something has gone wrong. Does anyone know what

R tm stemCompletion generates NA value

阅读更多关于 R tm stemCompletion generates NA value

问题 when i try to apply stemCompletion to a corpus , this function generates NA values.. this is my code: my.corpus <- tm_map(my.corpus, removePunctuation) my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) (one result of this is: [[2584]] zoning plan ) the next step is stamming corpus and so: my.corpus <- tm_map(my.corpus, stemDocument, language="english") my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first") but result is this [[2584]] NA plant

How to complete a stemmed corpus from a dictionary using stemCompletion function (tm package)

阅读更多关于 How to complete a stemmed corpus from a dictionary using stemCompletion function (tm package)

问题 I am having a trouble in the tm package of R. I am using 0.6.2 version. Following question (2 different errors) has already been answered here and here but still producing an error after using the posted solution. Please click here to download the dataset (93 rows only). It's a reproducible example. the two errors are below: (Resolved) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" Error: inherits(doc, "TextDocument") is not TRUE tm

R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

阅读更多关于 R tm: reloading a 'PCorpus' backend filehash database as corpus (e.g. in restarted session/script)

问题 Having learned loads from answers on this site (thanks!), it's finally time to ask my own question. I'm using R (tm and lsa packages) to create, clean and simplify, and then run LSA (latent semantic analysis) on, a corpus of about 15,000 text documents. I'm doing this in R 3.0.0 under Mac OS X 10.6. For efficiency (and to cope with having too little RAM), I've been trying to use either the 'PCorpus' (backend database support supported by the 'filehash' package) option in tm, or the newer 'tm

NLP process for combining common collocations

阅读更多关于 NLP process for combining common collocations

问题 I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams. What is this process called, of transforming

how to read text in a table from a csv file

阅读更多关于 how to read text in a table from a csv file

问题 I am new using the tm package. I want to read a csv file which contents one column with 2000 texts and a second column with a factor variable yes/no into a Corpus. My intention is to convert the text as a matrix and use the factor variable as target for prediction. I would need to divide the corpus in train and test sets as well. I read several documents like tm.pdf etc. and found the documentation relatively limited. This is my attempt following another threat on the same subject, TexTest<

A lemmatizing function using a hash dictionary does not work with tm package in R

阅读更多关于 A lemmatizing function using a hash dictionary does not work with tm package in R

问题 I would like to lemmatize Polish text using a large external dictionary (format like in txt variable below). I am not lucky, to have an option Polish with popular text mining packages. The answer https://stackoverflow.com/a/45790325/3480717 by @DmitriySelivanov works well with simple vector of texts. (I have also removed Polish diacritics from both the dictionary and corpus.) The function works well with a vector of texts. Unfortunately it does not work with the corpus format generated by tm.

How can I cluster thousands of documents using the R tm package?

阅读更多关于 How can I cluster thousands of documents using the R tm package?

问题 I have about 25000 documents which need to be clustered and I was hoping to be able to use the R tm package. Unfortunately I am running out of memory with about 20000 documents. The following function shows what I am trying to do using dummy data. I run out of memory when I call the function with n = 20 on a Windows machine with 16GB of RAM. Are there any optimizations I could make? Thank you for any help. make_clusters <- function(n) { require(tm) require(slam) docs <- unlist(lapply(letters

convert corpus into data.frame in R

阅读更多关于 convert corpus into data.frame in R

问题 I'm using the tm package to apply stemming, and I need to convert the resulting data into a data frame. A solution for this can be found here R tm package vcorpus: Error in converting corpus to data frame, but in my case I have the content of the corpus as: [[2195]] i was very impress instead of [[2195]] "i was very impress" and because of this, if I apply data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=FALSE) the result will be <NA>. Any help is much appreciated!