Getting viewable text words via Nokogiri

后端 未结 3 1954
粉色の甜心
粉色の甜心 2020-12-15 13:32

I\'d like to open a web page with Nokogiri and extract all the words that a user sees when they visit the page in a browser and analyze the word frequency.

What is t

3条回答
  •  佛祖请我去吃肉
    2020-12-15 13:51

    Update: since ruby 2.7 - there's new Enumerable method - tally - to count occurrences

    Bug in the chosen answer: html.at('body').inner_text - will join all the text from all the nodes - without spaces. For example document containing:

    this

    text

    will result in "thistext"

    Better: using this answer

    html = Nokogiri::HTML(open 'http://stackoverflow.com/questions/6129357')
    text = html.xpath('.//text() | text()').map(&:inner_text).join(' ')
    occurrences = text.scan(/\w+/).map(&:downcase).tally
    

提交回复
热议问题