Getting viewable text words via Nokogiri

后端 未结 3 1953
粉色の甜心
粉色の甜心 2020-12-15 13:32

I\'d like to open a web page with Nokogiri and extract all the words that a user sees when they visit the page in a browser and analyze the word frequency.

What is t

相关标签:
3条回答
  • 2020-12-15 13:37

    If you really want to do this with Nokogiri (and you can otherwise just use regex to strip tags), then you should:

    1. doc = Nokogiri::HTML(open('url').read) #open-uri
    2. strip all javascript and style tags with something like doc.search('script').each {|el| el.unlink}
    3. doc.text
    0 讨论(0)
  • 2020-12-15 13:51

    Update: since ruby 2.7 - there's new Enumerable method - tally - to count occurrences

    Bug in the chosen answer: html.at('body').inner_text - will join all the text from all the nodes - without spaces. For example document containing:

    <html><body><p>this</p><p>text</p></body><html>

    will result in "thistext"

    Better: using this answer

    html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357')
    text = html.xpath('.//text() | text()').map(&:inner_text).join(' ')
    occurrences = text.scan(/\w+/).map(&:downcase).tally
    
    0 讨论(0)
  • 2020-12-15 13:53

    You want the Nokogiri::XML::Node#inner_text method:

    require 'nokogiri'
    require 'open-uri'
    html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357')
    
    # Alternatively
    html = Nokogiri::HTML(IO.read 'myfile.html')
    
    text  = html.at('body').inner_text
    
    # Pretend that all words we care about contain only a-z, 0-9, or underscores
    words = text.scan(/\w+/)
    p words.length, words.uniq.length, words.uniq.sort[0..8]
    #=> 907
    #=> 428
    #=> ["0", "1", "100", "15px", "2", "20", "2011", "220px", "24158nokogiri"]
    
    # How about words that are only letters?
    words = text.scan(/[a-z]+/i)
    p words.length, words.uniq.length, words.uniq.sort[0..5]
    #=> 872
    #=> 406
    #=> ["Answer", "Ask", "Badges", "Browse", "DocumentFragment", "Email"]
    
    # Find the most frequent words
    require 'pp'
    def frequencies(words)
      Hash[
        words.group_by(&:downcase).map{ |word,instances|
          [word,instances.length]
        }.sort_by(&:last).reverse
      ]
    end
    pp frequencies(words)
    #=> {"nokogiri"=>34,
    #=>  "a"=>27,
    #=>  "html"=>18,
    #=>  "function"=>17,
    #=>  "s"=>13,
    #=>  "var"=>13,
    #=>  "b"=>12,
    #=>  "c"=>11,
    #=>  ...
    
    # Hrm...let's drop the javascript code out of our words
    html.css('script').remove
    words = html.at('body').inner_text.scan(/\w+/)
    pp frequencies(words)
    #=> {"nokogiri"=>36,
    #=>  "words"=>18,
    #=>  "html"=>17,
    #=>  "text"=>13,
    #=>  "with"=>12,
    #=>  "a"=>12,
    #=>  "the"=>11,
    #=>  "and"=>11,
    #=>  ...
    
    0 讨论(0)
提交回复
热议问题