tm

How to search for specific terms in a DTM

北城以北 提交于 2019-12-02 17:28:32
问题 I have a dataset of 200+ pdf's that I converted into a corpus. I'm using the TM package for R for text pre-processing and mining. So far, I've successfully created the DTM (document term matrix) and can find the x most frequently occuring terms. The goal of my research however, is to check if certain terms are used in the corpus. I'm not so much looking for the most frequent terms, but have my own list of terms that I want to check if they occur, and if so, how many times. So far, I've tried

How to store Sparsity and Maximum term length of a Term document matrix from tm

一个人想着一个人 提交于 2019-12-02 14:32:25
问题 how to store the sparsity and maximum term length of Term Document Matrix in separate variable in R while finding ngrams ? library(tm) library(RWeka) #stdout <- vector('character') #con <- textConnection('stdout','wr',local = TRUE) #reading the csv file worklog <- read.csv("To_Kamal_WorkLogs.csv"); #removing the unwanted columns cols <- c("A","B","C","D","E","F"); colnames(worklog)<-cols; worklog2 <- worklog[c("F")] #removing non-ASCII characters z=iconv(worklog2, "latin1", "ASCII", sub="")

How to store Sparsity and Maximum term length of a Term document matrix from tm

霸气de小男生 提交于 2019-12-02 12:14:28
how to store the sparsity and maximum term length of Term Document Matrix in separate variable in R while finding ngrams ? library(tm) library(RWeka) #stdout <- vector('character') #con <- textConnection('stdout','wr',local = TRUE) #reading the csv file worklog <- read.csv("To_Kamal_WorkLogs.csv"); #removing the unwanted columns cols <- c("A","B","C","D","E","F"); colnames(worklog)<-cols; worklog2 <- worklog[c("F")] #removing non-ASCII characters z=iconv(worklog2, "latin1", "ASCII", sub="") #cleaning the data Removing Date and Time worklog2$F=gsub("[0-9]+/[0-9]+/[0-9]+ [0-9]+:[0-9]+:[0-9]+ [A

install.packages(“tm”) -> “dependency 'slam' is not available”

岁酱吖の 提交于 2019-12-02 08:48:10
I'm trying to install the tm package on IBM's Data Science Experience (DSX): install.packages("tm") However, I'm hitting this issue: "dependency 'slam' is not available" This post suggests that R version 3.3.1 will resolve the issue, however the R version on DSX is: R version 3.3.0 (2016-05-03) How can I resolve this issue on IBM DSX? Note that you don't have root access on DSX. I've seen similar questions on stackoverflow, but none are asking how to fix the issue on IBM DSX, e.g. dependency ‘slam’ is not available when installing TM package Update: install.packages("slam") Returns: Installing

opencv&c++

雨燕双飞 提交于 2019-12-02 06:50:57
system() 相关函数:fork, execve, waitpid, popen头文件:#include <stdlib.h>定义函数:int system(const char * string); system(“pause”)可以实现冻结屏幕,便于观察程序的执行结果;system(“CLS”)可以实现清屏操作。而调用color函数可以改变控制台的前景色和背景,具体参数在下面说明。 例如,用 system(“color 0A”); 其中color后面的0是背景色代号,A是前景色代号。各颜色代码如下: 0=黑色 1=蓝色 2=绿色 3=湖蓝色 4=红色 5=紫色 6=黄色 7=白色 8=灰色 9=淡蓝色 A=淡绿色 B=淡浅绿色 C=淡红色 D=淡紫色 E=淡黄色 F=亮白色 FileStorage类 time_t 数据类型,time_t的类型是8字节的有符号整数。 包含文件:<time.h>。在time.h头文件中,我们还可以看到一些函数,它们都是以time_t为参数类型或返回值类型的函数: double difftime(time_t time1, time_t time0); time_t mktime(struct tm * timeptr); time_t time(time_t * timer); char * asctime(const struct tm *

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

家住魔仙堡 提交于 2019-12-02 05:44:55
问题 I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included? I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint. 回答1: You can modify a corpus to keep only the terms you want by

text mining with tm package in R ,remove words starting from [http] or any other specifc word

微笑、不失礼 提交于 2019-12-02 04:55:14
I am new to R and text mining. I had made a word cloud out of twitter feed related to some term. The problem that I'm facing is that in the wordcloud it shows http:... or htt... How do I deal about this issue I tried using metacharacter * but I still doubt if I'm applying it right tw.text = removeWords(tw.text,c(stopwords("en"),"rt","http\\*")) somebody into text-minning please help me with this. If you are looking to remove URLs from your string, you may use: gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x) Where x would be: x <- c("some text http://idontwantthis.com", "same problem again http:/

Remove all punctuation from text including apostrophes for tm package

寵の児 提交于 2019-12-02 04:18:03
I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation from the tm package like so: clean_tweet_text = removePunctuation(tweet_text) This have resulted in a vector with all punctuation removed from the text except apostrophes, which ruins my keyword searches because words touching apostrophes are not registered. For example, one of my keywords is climate but if a tweet has 'climate it won't be counted. How can I removes all the apostrophes/single quotes from my vector? Here is the header from dput for a

How can I manually set the document id in a corpus?

五迷三道 提交于 2019-12-02 02:58:10
问题 I am creating a Copus from a dataframe. I pass it as a VectorSource as there is only one column I want to be used as the text source. This works find however I need the document ids within the corpus to match the document ids from the dataframe. The document ids are stored in a separate column in the original dataframe. df <- as.data.frame(t(rbind(c(1,3,5,7,8,10), c("text", "lots of text", "too much text", "where will it end", "give peas a chance","help")))) colnames(df) <- c("ids",

How to select only a subset of corpus terms for TermDocumentMatrix creation in tm

假如想象 提交于 2019-12-02 02:07:19
I have a huge corpus, and I'm interested in only appearance of a handful of terms that I know up front. Is there a way to create a term document matrix from the corpus using the tm package, where only terms I specify up front are to be used and included? I know I can subset the resultant TermDocumentMatrix of the corpus, but I want to avoid building the full term document matrix to start with, due to memory size constraint. eipi10 You can modify a corpus to keep only the terms you want by building a custom transformation function. See the Vignette for the tm package and the help for the