text-mining | 易学教程

Counter ngram with tm package in R

阅读更多关于 Counter ngram with tm package in R

问题 I created a script for the frequency of words in a document using the object and a dictionary documentTermMatrix in R. The script works on individual words and not on the compound word es. "foo" "bar" "foo bar" This is the code require(tm) my.docs <- c("foo bar word1 word2") myCorpus <- Corpus(VectorSource(my.docs)) inspect(DocumentTermMatrix(myCorpus,list(dictionary = c("foo","bar","foo bar")))) But the result is Terms Docs bar foo foo bar 1 1 1 0 I would have to find one "foo bar" = 1 How

How to find term frequency within a DTM in R?

阅读更多关于 How to find term frequency within a DTM in R?

I've been using the tm package to create a DocumentTerm Matrix as follows: library(tm) library(RWeka) library(SnowballC) src <- DataframeSource(data.frame(data3$JobTitle)) # create a corpus and transform data # Sets the default number of threads to use options(mc.cores=1) c_copy <- c <- Corpus(src) c <- tm_map(c, content_transformer(tolower), mc.cores=1) c <- tm_map(c,content_transformer(removeNumbers), mc.cores=1) c <- tm_map(c,removeWords, stopwords("english"), mc.cores=1) c <- tm_map(c,content_transformer(stripWhitespace), mc.cores=1) #make DTM dtm <- DocumentTermMatrix(c, control = list

How to use OpenNLP to get POS tags in R?

阅读更多关于 How to use OpenNLP to get POS tags in R?

问题 Here is the R Code: library(NLP) library(openNLP) tagPOS <- function(x, ...) { s <- as.String(x) word_token_annotator <- Maxent_Word_Token_Annotator() a2 <- Annotation(1L, "sentence", 1L, nchar(s)) a2 <- annotate(s, word_token_annotator, a2) a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2) a3w <- a3[a3$type == "word"] POStags <- unlist(lapply(a3w$features, `[[`, "POS")) POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ") list(POStagged = POStagged, POStags = POStags)} str <

String Distance Matrix in Python

阅读更多关于 String Distance Matrix in Python

问题 How to calculate Levenshtein Distance matrix of strings in Python str1 str2 str3 str4 ... strn str1 0.8 0.4 0.6 0.1 ... 0.2 str2 0.4 0.7 0.5 0.1 ... 0.1 str3 0.6 0.5 0.6 0.1 ... 0.1 str4 0.1 0.1 0.1 0.5 ... 0.6 . . . . . ... . . . . . . ... . . . . . . ... . strn 0.2 0.1 0.1 0.6 ... 0.7 Using Ditance function we can calculate distance betwwen 2 words. But here I have 1 list containing n number of strings. I wanted to calculate distance matrix after that I want to do clustering of words. 回答1:

Big Text Corpus breaks tm_map

阅读更多关于 Big Text Corpus breaks tm_map

问题 I have been breaking my head over this one over the last few days. I searched all the SO archives and tried the suggested solutions but just can't seem to get this to work. I have sets of txt documents in folders such as 2000 06, 1995 -99 etc, and want to run some basic text mining operations such as creating document term matrix and term document matrix and doing some operations based co-locations of words. My script works on a smaller corpus, however, when I try it with the bigger corpus,

Lucene Entity Extraction

阅读更多关于 Lucene Entity Extraction

Given a finite dictionary of entity terms, I'm looking for a way to do Entity Extraction with intelligent tagging using Lucene. Currently I've been able to use Lucene for: - Searching for complex phrases with some fuzzyness - Highlighting results However, I 'm not aware how to: -Get accurate offsets of the matched phrases -Do entity-specific annotaions per match(not just tags for every single hit) I have tried using the explain() method - but this only gives the terms in the query which got the hit - not the offsets of the hit within the original text. Has anybody faced a similar problem and

how to read text in a table from a csv file

阅读更多关于 how to read text in a table from a csv file

I am new using the tm package. I want to read a csv file which contents one column with 2000 texts and a second column with a factor variable yes/no into a Corpus. My intention is to convert the text as a matrix and use the factor variable as target for prediction. I would need to divide the corpus in train and test sets as well. I read several documents like tm.pdf etc. and found the documentation relatively limited. This is my attempt following another threat on the same subject, TexTest<-read.csv("C:/Test.csv") m <- list(Text = "Text", Clasification = "Classification") corpus1 <- Corpus(x

Explicit Semantic Analysis

阅读更多关于 Explicit Semantic Analysis

I came across this term called 'Explicit Semantic Analysis ' which uses Wikipedia as a reference and finds the similarity in documents and categorizes them into classes (correct me if i am wrong). The link i came across is here I wanted to learn more about it. Please help me out with it ! This explicit semantic analysis works on similar lines as semantic similarity . I got hold of this link which provides a clear example of ESA 来源： https://stackoverflow.com/questions/8707624/explicit-semantic-analysis

Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

阅读更多关于 Support Vector Machine works on Training-set but not on Test-set in R (using e1071)

问题 I'm using a support vector machine for my document classification task! it classifies all my Articles in the training-set, but fails to classify the ones in my test-set! trainDTM is the document term matrix of my training-set. testDTM is the one for the test-set. here's my (not so beautiful) code: # create data.frame with labelled sentences labeled <- as.data.frame(read.xlsx("C:\\Users\\LABELED.xlsx", 1, header=T)) # create training set and test set traindata <- as.data.frame(labeled[1:700,c(

Best way to compare meaning of text documents?

阅读更多关于 Best way to compare meaning of text documents?

问题 I'm trying to find the best way to compare two text documents using AI and machine learning methods. I've used the TF-IDF-Cosine Similarity and other similarity measures, but this compares the documents at a word (or n-gram) level. I'm looking for a method that allows me to compare the meaning of the documents. What is the best way to do that? 回答1: You should start reading about word2vec model. use gensim, get the pretrained model of google. For vectoring a document, use Doc2vec() function.