How to scrape web content and then count frequencies of words in R?

血红的双手。 提交于 2019-12-03 08:51:53

Here you go, start to finish. I changed your code for web-scraping so it gets less non-text stuff and then down the bottom is the word counts.

Here's your code for downloading the URLs...

library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog          <- htmlParse(blog, encoding = "UTF-8")
titles  <- xpathSApply (blog ,"//loc",xmlValue)             ## titles

I've changed your function to extract the text from each page...

traverse_each_page <- function(x){
  tmp <- htmlParse(getURI(x))
  xpathSApply(tmp, '//div[@id="mainContent"]', xmlValue)
}
pages <- sapply(titles[2:3], traverse_each_page)

Let's remove newline and other non-text characters...

nont <- c("\n", "\t", "\r")
pages <- gsub(paste(nont,collapse="|"), " ", pages)

Regarding your second question, to inspect the contents in pages, just type it at the console:

pages

Now let's do your step 5 'Count the frequencies of each word that appears for all the articles, case-insensitive.'

require(tm)
# convert list into corpus 
mycorpus <- Corpus(VectorSource(pages))
# prepare to remove stopwords, ie. common words like 'the'
skipWords <- function(x) removeWords(x, stopwords("english"))
# prepare to remove other bits we usually don't care about
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
# do it
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)
# make document term matrix
mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10)))

Here's where you see the count of each word per document

inspect(mydtm)
# you can assign it to a data frame for more convenient viewing
my_df <- inspect(mydtm)
my_df

Here's how you count the total frequencies of each word that appears for all the articles, case-insensitive...

apply(mydtm, 2, sum)

Does that answer your question? I guess that you're probably really only interested in the most frequent words (as @buruzaemon's answer details), or a certain sub-set of words, but that's another question...

buruzaemon

You should have a look at the R tm package. Vignette is here. tm has everything you need to deal with corpora and building term-document matrices.

In principle, this would entail:

  1. Creating a Corpus from your data, possibly discarding punctuation, numbers, stopwords
  2. Create a TermDocumentMatrix from your Corpus or corpora, depending on what you are doing
  3. Use as.matrix on your TermDocumentMatrix, and do rowSums on your raw matrix to obtain word counts

Here is a brief code snippet as example:

library(tm)

ctrl <- list(stopwords=T,
             removePunctuation=T,
             removeNumbers=T)

# assuming your data is in some dataframe already...
corpus1 <- Corpus(DataframeSource(...))    
corpus2 <- Corpus(DataframeSource(...))

corp.all <- c(corpus1, corpus2)

tdm <- TermDocumentMatrix(corp.all, ctrl)
tdm.m <- as.matrix(tdm)

counts <- rowSums(tdm.m)

my.model <- data.frame(cbind(names(counts), as.numeric(counts)),
                       stringsAsFactors=F)

names(my.model) <- c('term', 'frequency')

my.model$frequency <- as.numeric(my.model$frequency)

head(my.model[with(my.model, order(-frequency)),], 20)

I think that you might want to discard stop words like "the", since they tend to be very frequent and thus not very important. More detailed information on text mining in general can be found from this CrossValidated question thread.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!