text-analysis

Error using “TermDocumentMatrix” and “Dist” functions in R

柔情痞子 提交于 2019-12-06 13:39:30
I have been trying to replicate the example here : but I have had some problems along the way. Everything worked fine until here: docsTDM <- TermDocumentMatrix(docs8) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character" In addition: Warning message: In mclapply(unname(content(x)), termFreq, control) : all scheduled cores encountered errors in user code So I was able to fix that error modifying this previous step by changing this: docs8 <- tm_map(docs7, tolower) To this: docs8 <- tm_map(docs7, content_transformer(tolower)) But then I got in

Why isn't Stanford Topic Modeling Toolbox producing lda-output directory?

我是研究僧i 提交于 2019-12-06 12:24:54
问题 I tried to run this code from github (following the 1-2-3 steps) which identifies 30 topics in Sarah Palin's 14,500 emails. The topics discovered by the author are here. However, Stanford Topic Modeling Toolbox is not producing lda-output directory for me. It produced the lda-86a58136-30-2b1a90a6, but the summary.txt in this folder only shows the initial assignment of topics, not the final one. Any idea how to produce lda-output directory with the final summary of topics discovered? Thanks in

Create sentence (row) to POS tags counts (column) matrix from a dataframe

孤者浪人 提交于 2019-12-06 11:12:08
I am trying to build a matrix where the first row will be a part of speech, first column a sentence. values in the matrix should show the number of such POS in a sentence. So I am creating POS tags in this way: data = pd.read_csv(open('myfile.csv'),sep=';') target = data["label"] del data["label"] data.sentence = data.sentence.str.lower() # All strings in data frame to lowercase for line in data.sentence: Line_new= nltk.pos_tag(nltk.word_tokenize(line)) print(Line_new) The output is: [('together', 'RB'), ('with', 'IN'), ('the', 'DT'), ('6th', 'CD'), ('battalion', 'NN'), ('of', 'IN'), ('the',

Quanteda package, Naive Bayes: How can I predict on different-featured test data?

点点圈 提交于 2019-12-06 09:21:06
问题 I used quanteda::textmodel_NB to create a model that categorizes text into one of two categories. I fit the model on a training data set of data from last summer. Now, I am trying to use it this summer to categorize new text we get here at work. I tried doing this and got the following error: Error in predict.textmodel_NB_fitted(model, test_dfm) : feature set in newdata different from that in training set The code in the function that generates the error can be found here at lines 157 to 165.

Big Text Corpus breaks tm_map

老子叫甜甜 提交于 2019-12-06 00:30:26
问题 I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus,

How to split a text into two meaningful words in R

你离开我真会死。 提交于 2019-12-05 22:19:24
this is the text in my dataframe df which has a text column called 'problem_note_text' SSCIssue: Note Dispenser Failureperformed checks / dispensor failure / asked the stores to take the note dispensor out and set it back / still error message says front door is open / hence CE attn reqContact details - Olivia taber 01159063390 / 7am-11pm df$problem_note_text <- tolower(df$problem_note_text) df$problem_note_text <- tm::removeNumbers(df$problem_note_text) df$problem_note_text<- str_replace_all(df$problem_note_text, " ", "") # replace double spaces with single space df$problem_note_text = str

R Text Mining with quanteda

谁说胖子不能爱 提交于 2019-12-04 22:41:32
I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code. # Load the relevant dictionary (relevant for analysis) liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") # Read File # Facebooks posts could be generated by FB Netvizz # https://apps.facebook.com/netvizz # Load FB posts as .csv-file from .zip-file fbpost <- read.csv("D:/FB-com.csv", sep=";") # Define the relevant column(s) fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries # Define as corpus fb_corp <-corpus(fb_test) class(fb_corp) # LIWC

tm Package error: Error definining Document Term Matrix

∥☆過路亽.° 提交于 2019-12-04 17:21:56
I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message: Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character" All pre-processing steps work correctly up until document term matrix. I created a non-random subset of the corpus (with 4000 documents) and

String matching to estimate similarity

爷,独闯天下 提交于 2019-12-04 16:55:10
I want to analyse a field of 100 character length and estimate similarity %. For example, for a same question "Whats your opinion on smartphone?", Person A: "Best way to waste money" Person B: "Amazing stuff. lets you stay connected all the time" Person C: "Instrument to waste money and time" Out of these, just by matching individual words, A and C sound similar. I am trying to do something like this to start with in R and later on extend to match combination of words like "Best", "Best way", "Best way waste" etc. I am newbie to text analysis and R and could not get the proper naming of these

Big Text Corpus breaks tm_map

怎甘沉沦 提交于 2019-12-04 06:46:52
I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus, it fails me. I have pasted in the code for one such folder operation. library(tm) # Framework for text