text mining with tm package in R ,remove words starting from [http] or any other specifc word

微笑、不失礼 提交于 2019-12-02 04:55:14

If you are looking to remove URLs from your string, you may use:

gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)

Where x would be:

x <- c("some text http://idontwantthis.com", 
         "same problem again http://pleaseremoveme.com")

It would be easier to provide you with a specific answer if you could post sample of your data but the following example would give you a clean text with no URLs:

> clean_x <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
> clean_x
[1] "some text "          "same problem again "

As a side point, I would suggest that it may be worth searching for the existing methods to clean text before mining. For example the clean function discussed here would enable you to do this automatically. On similar lines, there are function to clean your text from tweets (#,@), punctuation and other undesirable entries.

Apply the below code to corpus to replace a string pattern with space. String pattern can be urls or terms you want to remove from the wordcloud. For example to remove terms starting with https:

replace with space

toSpace = content_transformer( function(x, pattern) gsub(pattern," ",x) )

tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "https*")

Or pass a pattern as below to remove urls

tweet_corpus_clean = tm_map( tweet_corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!