问题
This is my code:
library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog <- htmlParse(blog, encoding = "UTF-8")
titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles
traverse_each_page <- function(x){
tmp <- htmlParse(x)
xpathApply(tmp, '//div[@id="mainContent"]')
}
pages <- lapply(titles[2:3], traverse_each_page)
Here is the pseudocode:
- Take a xml document:
http://www.jamesaltucher.com/sitemap.xml
- Go to each link
- Parse the html content of each link
- Extract the text inside
div id="mainContent"
- Count the frequencies of each word that appears for all the articles, case-insensitive.
I have managed to complete steps 1-4. I need some help with no. 5.
Basically if the word "the" appears twice in article 1 and five times in article 2. I want to know that "the" appears a total of seven times in 2 articles.
Also, I do not know how to view the contents I have extracted into pages
. I want to learn how to view the contents which will make it easier for me to debug.
回答1:
Here you go, start to finish. I changed your code for web-scraping so it gets less non-text stuff and then down the bottom is the word counts.
Here's your code for downloading the URLs...
library(XML)
library(RCurl)
url.link <- 'http://www.jamesaltucher.com/sitemap.xml'
blog <- getURL(url.link)
blog <- htmlParse(blog, encoding = "UTF-8")
titles <- xpathSApply (blog ,"//loc",xmlValue) ## titles
I've changed your function to extract the text from each page...
traverse_each_page <- function(x){
tmp <- htmlParse(getURI(x))
xpathSApply(tmp, '//div[@id="mainContent"]', xmlValue)
}
pages <- sapply(titles[2:3], traverse_each_page)
Let's remove newline and other non-text characters...
nont <- c("\n", "\t", "\r")
pages <- gsub(paste(nont,collapse="|"), " ", pages)
Regarding your second question, to inspect the contents in pages
, just type it at the console:
pages
Now let's do your step 5 'Count the frequencies of each word that appears for all the articles, case-insensitive.'
require(tm)
# convert list into corpus
mycorpus <- Corpus(VectorSource(pages))
# prepare to remove stopwords, ie. common words like 'the'
skipWords <- function(x) removeWords(x, stopwords("english"))
# prepare to remove other bits we usually don't care about
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
# do it
a <- tm_map(mycorpus, FUN = tm_reduce, tmFuns = funcs)
# make document term matrix
mydtm <- DocumentTermMatrix(a, control = list(wordLengths = c(3,10)))
Here's where you see the count of each word per document
inspect(mydtm)
# you can assign it to a data frame for more convenient viewing
my_df <- inspect(mydtm)
my_df
Here's how you count the total frequencies of each word that appears for all the articles, case-insensitive...
apply(mydtm, 2, sum)
Does that answer your question? I guess that you're probably really only interested in the most frequent words (as @buruzaemon's answer details), or a certain sub-set of words, but that's another question...
回答2:
You should have a look at the R tm
package. Vignette is here. tm
has everything you need to deal with corpora and building term-document matrices.
In principle, this would entail:
- Creating a
Corpus
from your data, possibly discarding punctuation, numbers, stopwords - Create a
TermDocumentMatrix
from yourCorpus
or corpora, depending on what you are doing - Use
as.matrix
on yourTermDocumentMatrix
, and dorowSums
on your raw matrix to obtain word counts
Here is a brief code snippet as example:
library(tm)
ctrl <- list(stopwords=T,
removePunctuation=T,
removeNumbers=T)
# assuming your data is in some dataframe already...
corpus1 <- Corpus(DataframeSource(...))
corpus2 <- Corpus(DataframeSource(...))
corp.all <- c(corpus1, corpus2)
tdm <- TermDocumentMatrix(corp.all, ctrl)
tdm.m <- as.matrix(tdm)
counts <- rowSums(tdm.m)
my.model <- data.frame(cbind(names(counts), as.numeric(counts)),
stringsAsFactors=F)
names(my.model) <- c('term', 'frequency')
my.model$frequency <- as.numeric(my.model$frequency)
head(my.model[with(my.model, order(-frequency)),], 20)
I think that you might want to discard stop words like "the", since they tend to be very frequent and thus not very important. More detailed information on text mining in general can be found from this CrossValidated question thread.
来源:https://stackoverflow.com/questions/19851655/how-to-scrape-web-content-and-then-count-frequencies-of-words-in-r