text-mining | 易学教程

PANDAS find exact given string/word from a column

阅读更多关于 PANDAS find exact given string/word from a column

问题 So, I have a pandas column name Notes which contains a sentence or explanation of some event. I am trying find some given words from that column and when I find that word I am adding that to the next column as Type The problem is for some specific word for example Liar , Lies its picking up word like familiar and families because they both have liar and lies in them. Notes Type 2 families are living in the address Lies He is a liar Liar We are not familiar with this Liar As you can see from

Error in nchar(Terms(x), type = “chars”) : invalid multibyte string, element 204, when inspecting document term matrix

阅读更多关于 Error in nchar(Terms(x), type = “chars”) : invalid multibyte string, element 204, when inspecting document term matrix

问题 Here is the source code that I have used: MyData <- Corpus(DirSource("F:/Data/CSV/Data"),readerControl = list(reader=readPlain,language="cn")) SegmentedData <- lapply(MyData, function(x) unlist(segmentCN(x))) temp <- Corpus(DataframeSource(SegmentedData), readerControl = list(reader=readPlain, language="cn")) Preprocessing Data temp <- tm_map(temp, removePunctuation) temp <- tm_map(temp,removeNumbers) removeURL <- function(x)gsub("http[[:alnum:]]*"," ",x) temp <- tm_map(temp, removeURL) temp

Error in nchar(Terms(x), type = “chars”) : invalid multibyte string, element 204, when inspecting document term matrix

阅读更多关于 Error in nchar(Terms(x), type = “chars”) : invalid multibyte string, element 204, when inspecting document term matrix

Find similar texts based on paraphrase detection [closed]

阅读更多关于 Find similar texts based on paraphrase detection [closed]

问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . Improve this question I am interested in finding similar content(text) based on paraphrasing. How do I do this? Are there any specific tools which can do this? In python preferably. 回答1: I believe the tool you are looking for is Latent Semantic Analysis. Given that my post is going to

Find similar texts based on paraphrase detection [closed]

阅读更多关于 Find similar texts based on paraphrase detection [closed]

Structural Topic Modeling in R: group the topics deductively and estimate effect

阅读更多关于 Structural Topic Modeling in R: group the topics deductively and estimate effect

问题 The stm package in R allows the user to estimate the relationship between metadata and topics. I have a model M with 40 topics, and I want to explore how they change with time. In stm, it should be something like this (adapted from Molly Roberts et al., stm: R Package for Structural Topic Models): prep = estimateEffect(1:40 ~ s(day), M, meta = out$meta, uncertainty = "Global") This command will return 40 pairs of relationships, each refers to one topic. However, upon reading the topics I

Word substitution within tidy text format

阅读更多关于 Word substitution within tidy text format

问题 Hi i'm working with a tidy_text format and i am trying to substitute the strings "emails" and "emailing" into "email". set.seed(123) terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem") df <- data.frame(sentence = sample(terms, 100, replace = TRUE)) df str(df) df$sentence <- as.character(df$sentence) tidy_df <- df %>% unnest_tokens(word, sentence) tidy_df %>% count(word, sort = TRUE) %>% filter( n > 20) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)

Quotes and hyphens not removed by tm package functions while cleaning corpus

阅读更多关于 Quotes and hyphens not removed by tm package functions while cleaning corpus

问题 I'm trying to clean the corpus and I've used the typical steps, like the code below: docs<-Corpus(DirSource(path)) docs<-tm_map(docs,content_transformer(tolower)) docs<-tm_map(docs,content_transformer(removeNumbers)) docs<-tm_map(docs,content_transformer(removePunctuation)) docs<-tm_map(docs,removeWords,stopwords('en')) docs<-tm_map(docs,stripWhitespace) docs<-tm_map(docs,stemDocument) dtm<-DocumentTermMatrix(docs) Yet when I inspect the matrix there are few words that come with quotes, such

Text summarization in R language

阅读更多关于 Text summarization in R language

问题 I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences. How to summarize text in at least 10 line with R language ? 回答1: You may try this (from the LSAfun package): genericSummary(D,k=1) whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation). For more information: http://search.r-project.org/library/LSAfun/html

Text summarization in R language

阅读更多关于 Text summarization in R language