Text mining pdf files/issues with word frequencies

问题

I am trying to mine a pdf of an article with rich pdf encodings and graphs. I noticed that when i mine some pdf documents i get the high frequency words to be phi, taeoe,toe,sigma, gamma etc. It works well with some pdf documents but i get these random greek letters with others. Is this the problem with character encoding? (Btw all the documents are in english). Any suggestions?

# Here is the link to pdf file for testing
# www.sciencedirect.com/science/article/pii/S0164121212000532
library(tm)
uri <- c("2012.pdf")
if(all(file.exists(Sys.which(c("pdfinfo", "pdftotext"))))) {
 pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                              language = "en",
                                              id = "id1")
 content(pdf)[1:4]
 }


docs<- Corpus(URISource(uri, mode = ""),
    readerControl = list(reader = readPDF(engine = "ghostscript")))
summary(docs)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 

library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  

dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))   
length(freq)  
ord <- order(freq)
dtms <- removeSparseTerms(dtm, 0.1)
freq[head(ord)] 
freq[tail(ord)]

回答1:

I think that ghostscript is creating all the trouble here. Assuming that pdfinfo and pdftotext are properly installed, this code works without generating the weird words that you mentioned:

library(tm)
uri <- c("2012.pdf")
pdf <- readPDF(control = list(text = "-layout"))(elem = list(uri = uri),
                                               language = "en",
                                               id = "id1")
docs <- Corpus(VectorSource(pdf$content))
docs <- tm_map(docs, removeNumbers)  
docs <- tm_map(docs, tolower) 
docs <- tm_map(docs, removeWords, stopwords("english")) 
docs <- tm_map(docs, removePunctuation) 
library(SnowballC)   
docs <- tm_map(docs, stemDocument)  
docs <- tm_map(docs, stripWhitespace) 
docs <- tm_map(docs, PlainTextDocument)  
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs) 
freq <- colSums(as.matrix(dtm))

We can visualize the result of the most frequently used words in your pdf file with a word cloud:

library(wordcloud)
wordcloud(docs, max.words=80, random.order=FALSE, scale= c(3, 0.5), colors=brewer.pal(8,"Dark2"))

Obviously this result is not perfect; mostly because word stemming hardly ever achieves a 100% reliable result (e.g., we have still "issues" and "issue" as separate words; or "method" and "methods"). I am not aware of any infallible stemming algorithm in R, even though SnowballC does a reasonably good job.

来源：https://stackoverflow.com/questions/32478685/text-mining-pdf-files-issues-with-word-frequencies

标签

pdf

ghostscript

text-recognition