tm

Lemmatization using txt file with lemmes in R

假装没事ソ 提交于 2019-12-04 19:47:56
I would like to use external txt file with Polish lemmas structured as follows: (source for lemmas for many other languages http://www.lexiconista.com/datasets/lemmatization/ ) Abadan Abadanem Abadan Abadanie Abadan Abadanowi Abadan Abadanu abadańczyk abadańczycy abadańczyk abadańczyka abadańczyk abadańczykach abadańczyk abadańczykami abadańczyk abadańczyki abadańczyk abadańczykiem abadańczyk abadańczykom abadańczyk abadańczyków abadańczyk abadańczykowi abadańczyk abadańczyku abadanka abadance abadanka abadanek abadanka abadanką abadanka abadankach abadanka abadankami What packages and with

How can I cluster thousands of documents using the R tm package?

醉酒当歌 提交于 2019-12-04 19:14:12
I have about 25000 documents which need to be clustered and I was hoping to be able to use the R tm package. Unfortunately I am running out of memory with about 20000 documents. The following function shows what I am trying to do using dummy data. I run out of memory when I call the function with n = 20 on a Windows machine with 16GB of RAM. Are there any optimizations I could make? Thank you for any help. make_clusters <- function(n) { require(tm) require(slam) docs <- unlist(lapply(letters[1:n],function(x) rep(x,1000))) tdf <- TermDocumentMatrix(Corpus(VectorSource(docs)),control=list

R: add title to wordcloud graphics / png

╄→гoц情女王★ 提交于 2019-12-04 17:57:27
问题 I have some working R code that generates a tag cloud from a term-document matrix. Now I want to create a whole bunch of tag clouds from many documents, and to inspect them visually at a later time. To know which document(s)/corpus the tag-cloud picture belongs to, I'd lke to add a title to the generated graphic. How do I do that? Maybe this is obvious, but I'm still a beginner with R graphics. My own corpus is too big to list it here, but the code from this SO question (combined with the

Compute ngrams for each row of text data in R

六眼飞鱼酱① 提交于 2019-12-04 17:47:59
I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) # Using Tyler's method of making the 'Text'

Does tm package itself provide a built-in way to combine document-term matrices?

风流意气都作罢 提交于 2019-12-04 17:42:18
Does tm package itself provide a built-in way to combine document-term matrices? I generated 4 document term matrices on the same corpus, each for 1,2,3,4 gram. They are all really big: 200k*10k so converting them into data frames and then cbinding them is out of question. I know I could write a program recording the non-zero elements in each of the matrices and build a sparse-matrix, but that is a lot of trouble. It just seems natural for tm package to provide this functionality. So if it does,I don't want to rebuild something that has already been built. If it doesn't, is there any handier

Split delimited strings into distinct columns in R dataframe

妖精的绣舞 提交于 2019-12-04 17:39:15
I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") ) (pls note the different delimiters among columns) The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives) I need two data frames like those: tok1.occurrences: +----+---+---+---+---+---+ | id | a | b | c | d | e | +----+---+---+---+---+---+ | 1 | 1 | 1 |

Error in UseMethod(“meta”, x) : no applicable method for 'try-error' applied to an object of class “character”

守給你的承諾、 提交于 2019-12-04 17:36:23
I am using tm package in R to do stemming in my corpus. However, I got a problem when I ran DocumentTermMartix "Error in UseMethod("meta", x) : no applicable method for 'try-error' applied to an object of class "character" here is my workflow: library(tm) myCorpus <- Corpus(VectorSource(training$FullDescription)) myCorpus <- tm_map(myCorpus, content_transformer(tolower), lazy=TRUE) myCorpus <- tm_map(myCorpus, removePunctuation, lazy=TRUE) myCorpus <- tm_map(myCorpus, removeNumbers, lazy=TRUE) myStopwords <- c(stopwords('english'), "available", "via") myCorpus <- tm_map(myCorpus, removeWords,

tm Package error: Error definining Document Term Matrix

∥☆過路亽.° 提交于 2019-12-04 17:21:56
I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message: Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character" All pre-processing steps work correctly up until document term matrix. I created a non-random subset of the corpus (with 4000 documents) and

Making a wordcloud, but with combined words?

空扰寡人 提交于 2019-12-04 16:55:41
I am trying to make a word cloud of publications keywords. for example: Educational data mining; collaborative learning; computer science...etc My current code is as the following: KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012))) KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation) KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers) # added tolower KeywordsCorpus <- tm_map(KeywordsCorpus, tolower) KeywordsCorpus <- tm_map(KeywordsCorpus, removeWords, stopwords("english")) # moved stripWhitespace KeywordsCorpus <- tm_map(KeywordsCorpus,

tx-lcn探险

a 夏天 提交于 2019-12-04 16:06:15
一、描述:   随着分布式服务的到来,分布式事务必然也就成为分布式的重点,为此调研tx-lcn 二、tx-lcn的TC与TM的交互图:    注:其中事务发起方为TC-A,而TC-B、TC-C为事务参与方,TM为事务处理器:      交互描述:    1、TC-A向TM发送创建组请求    2、TC-B向TM发送加入组请求    3、TC-C向TM发送加入组请求    4、TC-B将调用结果返回TC-A    5、TC-B将事务状态发送到TM    6、TC-C将调用结果返回TC-A    7、TC-C将事务状态发送到TM    8、TC-A向TM发送事务完成通知    9、TM根据事务情况,分别向TC-A,TC-B,TC-C发送事务通知 二、各注解的适用:    1、@LCNTransaction:(一般用于mysql或oracle,在于支持事务场景)     代理本地connection连接,使用数据库本身提供的commit或者rollback完成事务的提交或者回滚   2、@TCCTransaction:(一般用于redis,memcache)     不代理本地connection连接,以callback指定类,confirm、cancel指定类内的方法完成事务的提交或者回滚   3、@TXCTransaction:(一般用于mysql或oracle,在于不支持事务场景)