How to scrape web content and then count frequencies of words in R?

问题

This is my code:

library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog          <- htmlParse(blog, encoding = "UTF-8")
titles  <- xpathSApply (blog ,"//loc",xmlValue)             ## titles

traverse_each_page <- function(x){
  tmp <- htmlParse(x)
  xpathApply(tmp, '//div[@id="mainContent"]')
}
pages <- lapply(titles[2:3], traverse_each_page)

Here is the pseudocode:

Take a xml document: http://www.jamesaltucher.com/sitemap.xml
Go to each link
Parse the html content of each link
Extract the text inside div id="mainContent"
Count the frequencies of each word that appears for all the articles, case-insensitive.

I have managed to complete steps 1-4. I need some help with no. 5.

Basically if the word "the" appears twice in article 1 and five times in article 2. I want to know that "the" appears a total of seven times in 2 articles.

Also, I do not know how to view the contents I have extracted into pages. I want to learn how to view the contents which will make it easier for me to debug.

回答1:

Here you go, start to finish. I changed your code for web-scraping so it gets less non-text stuff and then down the bottom is the word counts.

Here's your code for downloading the URLs...

library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog          <- htmlParse(blog, encoding = "UTF-8")
titles  <- xpathSApply (blog ,"//loc",xmlValue)             ## titles

I've changed your function to extract the text from each page...

traverse_each_page <- function(x){
  tmp <- htmlParse(getURI(x))
  xpathSApply(tmp, '//div[@id="mainContent"]', xmlValue)
}
pages <- sapply(titles[2:3], traverse_each_page)

Let's remove newline and other non-text characters...

nont <- c("\n", "\t", "\r")
pages <- gsub(paste(nont,collapse="|"), " ", pages)

Regarding your second question, to inspect the contents in pages, just type it at the console:

pages

Now let's do your step 5 'Count the frequencies of each word that appears for all the articles, case-insensitive.'

require(tm)
# convert list into corpus 
mycorpus <- Corpus(VectorSource(pages))
# prepare to remove stopwords, ie. common words like 'the'
skipWords <- function(x) removeWords(x, stopwords("english"))
# prepare to remove other bits we usually don't care about
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
# do it
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)
# make document term matrix
mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10)))

Here's where you see the count of each word per document

inspect(mydtm)
# you can assign it to a data frame for more convenient viewing
my_df <- inspect(mydtm)
my_df

Here's how you count the total frequencies of each word that appears for all the articles, case-insensitive...

apply(mydtm, 2, sum)

Does that answer your question? I guess that you're probably really only interested in the most frequent words (as @buruzaemon's answer details), or a certain sub-set of words, but that's another question...

回答2:

You should have a look at the R tm package. Vignette is here. tm has everything you need to deal with corpora and building term-document matrices.

In principle, this would entail:

Creating a Corpus from your data, possibly discarding punctuation, numbers, stopwords
Create a TermDocumentMatrix from your Corpus or corpora, depending on what you are doing
Use as.matrix on your TermDocumentMatrix, and do rowSums on your raw matrix to obtain word counts

Here is a brief code snippet as example:

library(tm)

ctrl <- list(stopwords=T,
             removePunctuation=T,
             removeNumbers=T)

# assuming your data is in some dataframe already...
corpus1 <- Corpus(DataframeSource(...))    
corpus2 <- Corpus(DataframeSource(...))

corp.all <- c(corpus1, corpus2)

tdm <- TermDocumentMatrix(corp.all, ctrl)
tdm.m <- as.matrix(tdm)

counts <- rowSums(tdm.m)

my.model <- data.frame(cbind(names(counts), as.numeric(counts)),
                       stringsAsFactors=F)

names(my.model) <- c('term', 'frequency')

my.model$frequency <- as.numeric(my.model$frequency)

head(my.model[with(my.model, order(-frequency)),], 20)

I think that you might want to discard stop words like "the", since they tend to be very frequent and thus not very important. More detailed information on text mining in general can be found from this CrossValidated question thread.

来源：https://stackoverflow.com/questions/19851655/how-to-scrape-web-content-and-then-count-frequencies-of-words-in-r

标签

web-scraping

text-mining