Getting viewable text words via Nokogiri

妖精的绣舞 提交于 2019-12-18 02:54:25

问题


I'd like to open a web page with Nokogiri and extract all the words that a user sees when they visit the page in a browser and analyze the word frequency.

What is the easiest way of getting all readable words out of an html document with nokogiri? The ideal code snippet would take a html page (as a file, say) and give an array of individual words that come from all types of elements that are readable.

(No need to worry about javascript or css hiding elements and thus hiding words; just all words designed for display is fine.)


回答1:


You want the Nokogiri::XML::Node#inner_text method:

require 'nokogiri'
require 'open-uri'
html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357')

# Alternatively
html = Nokogiri::HTML(IO.read 'myfile.html')

text  = html.at('body').inner_text

# Pretend that all words we care about contain only a-z, 0-9, or underscores
words = text.scan(/\w+/)
p words.length, words.uniq.length, words.uniq.sort[0..8]
#=> 907
#=> 428
#=> ["0", "1", "100", "15px", "2", "20", "2011", "220px", "24158nokogiri"]

# How about words that are only letters?
words = text.scan(/[a-z]+/i)
p words.length, words.uniq.length, words.uniq.sort[0..5]
#=> 872
#=> 406
#=> ["Answer", "Ask", "Badges", "Browse", "DocumentFragment", "Email"]
# Find the most frequent words
require 'pp'
def frequencies(words)
  Hash[
    words.group_by(&:downcase).map{ |word,instances|
      [word,instances.length]
    }.sort_by(&:last).reverse
  ]
end
pp frequencies(words)
#=> {"nokogiri"=>34,
#=>  "a"=>27,
#=>  "html"=>18,
#=>  "function"=>17,
#=>  "s"=>13,
#=>  "var"=>13,
#=>  "b"=>12,
#=>  "c"=>11,
#=>  ...

# Hrm...let's drop the javascript code out of our words
html.css('script').remove
words = html.at('body').inner_text.scan(/\w+/)
pp frequencies(words)
#=> {"nokogiri"=>36,
#=>  "words"=>18,
#=>  "html"=>17,
#=>  "text"=>13,
#=>  "with"=>12,
#=>  "a"=>12,
#=>  "the"=>11,
#=>  "and"=>11,
#=>  ...



回答2:


If you really want to do this with Nokogiri (and you can otherwise just use regex to strip tags), then you should:

  1. doc = Nokogiri::HTML(open('url').read) #open-uri
  2. strip all javascript and style tags with something like doc.search('script').each {|el| el.unlink}
  3. doc.text


来源:https://stackoverflow.com/questions/6129357/getting-viewable-text-words-via-nokogiri

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!