text-mining

PANDAS find exact given string/word from a column

只愿长相守 提交于 2021-02-11 17:36:42
问题 So, I have a pandas column name Notes which contains a sentence or explanation of some event. I am trying find some given words from that column and when I find that word I am adding that to the next column as Type The problem is for some specific word for example Liar , Lies its picking up word like familiar and families because they both have liar and lies in them. Notes Type 2 families are living in the address Lies He is a liar Liar We are not familiar with this Liar As you can see from

Error in nchar(Terms(x), type = “chars”) : invalid multibyte string, element 204, when inspecting document term matrix

ε祈祈猫儿з 提交于 2021-02-11 13:46:34
问题 Here is the source code that I have used: MyData <- Corpus(DirSource("F:/Data/CSV/Data"),readerControl = list(reader=readPlain,language="cn")) SegmentedData <- lapply(MyData, function(x) unlist(segmentCN(x))) temp <- Corpus(DataframeSource(SegmentedData), readerControl = list(reader=readPlain, language="cn")) Preprocessing Data temp <- tm_map(temp, removePunctuation) temp <- tm_map(temp,removeNumbers) removeURL <- function(x)gsub("http[[:alnum:]]*"," ",x) temp <- tm_map(temp, removeURL) temp

Error in nchar(Terms(x), type = “chars”) : invalid multibyte string, element 204, when inspecting document term matrix

安稳与你 提交于 2021-02-11 13:45:16
问题 Here is the source code that I have used: MyData <- Corpus(DirSource("F:/Data/CSV/Data"),readerControl = list(reader=readPlain,language="cn")) SegmentedData <- lapply(MyData, function(x) unlist(segmentCN(x))) temp <- Corpus(DataframeSource(SegmentedData), readerControl = list(reader=readPlain, language="cn")) Preprocessing Data temp <- tm_map(temp, removePunctuation) temp <- tm_map(temp,removeNumbers) removeURL <- function(x)gsub("http[[:alnum:]]*"," ",x) temp <- tm_map(temp, removeURL) temp

Find similar texts based on paraphrase detection [closed]

眉间皱痕 提交于 2021-02-08 10:32:21
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . Improve this question I am interested in finding similar content(text) based on paraphrasing. How do I do this? Are there any specific tools which can do this? In python preferably. 回答1: I believe the tool you are looking for is Latent Semantic Analysis. Given that my post is going to

Find similar texts based on paraphrase detection [closed]

时光怂恿深爱的人放手 提交于 2021-02-08 10:31:04
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . Improve this question I am interested in finding similar content(text) based on paraphrasing. How do I do this? Are there any specific tools which can do this? In python preferably. 回答1: I believe the tool you are looking for is Latent Semantic Analysis. Given that my post is going to

Structural Topic Modeling in R: group the topics deductively and estimate effect

核能气质少年 提交于 2021-02-08 09:16:22
问题 The stm package in R allows the user to estimate the relationship between metadata and topics. I have a model M with 40 topics, and I want to explore how they change with time. In stm, it should be something like this (adapted from Molly Roberts et al., stm: R Package for Structural Topic Models): prep = estimateEffect(1:40 ~ s(day), M, meta = out$meta, uncertainty = "Global") This command will return 40 pairs of relationships, each refers to one topic. However, upon reading the topics I

Word substitution within tidy text format

江枫思渺然 提交于 2021-02-07 20:22:08
问题 Hi i'm working with a tidy_text format and i am trying to substitute the strings "emails" and "emailing" into "email". set.seed(123) terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem") df <- data.frame(sentence = sample(terms, 100, replace = TRUE)) df str(df) df$sentence <- as.character(df$sentence) tidy_df <- df %>% unnest_tokens(word, sentence) tidy_df %>% count(word, sort = TRUE) %>% filter( n > 20) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)

Quotes and hyphens not removed by tm package functions while cleaning corpus

左心房为你撑大大i 提交于 2021-02-07 12:39:26
问题 I'm trying to clean the corpus and I've used the typical steps, like the code below: docs<-Corpus(DirSource(path)) docs<-tm_map(docs,content_transformer(tolower)) docs<-tm_map(docs,content_transformer(removeNumbers)) docs<-tm_map(docs,content_transformer(removePunctuation)) docs<-tm_map(docs,removeWords,stopwords('en')) docs<-tm_map(docs,stripWhitespace) docs<-tm_map(docs,stemDocument) dtm<-DocumentTermMatrix(docs) Yet when I inspect the matrix there are few words that come with quotes, such

Text summarization in R language

社会主义新天地 提交于 2021-02-07 04:19:29
问题 I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences. How to summarize text in at least 10 line with R language ? 回答1: You may try this (from the LSAfun package): genericSummary(D,k=1) whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation). For more information: http://search.r-project.org/library/LSAfun/html

Text summarization in R language

走远了吗. 提交于 2021-02-07 04:17:30
问题 I have long text file using help of R language I want to summarize text in at least 10 to 20 line or in small sentences. How to summarize text in at least 10 line with R language ? 回答1: You may try this (from the LSAfun package): genericSummary(D,k=1) whereby 'D' specifies your text document and 'k' the number of sentences to be used in the summary. (Further modifications are shown in the package documentation). For more information: http://search.r-project.org/library/LSAfun/html