HTML to Plain Text with Ruby?

后端 未结 9 1997
无人共我
无人共我 2020-12-15 18:03

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that\'s about it.

If I write som

相关标签:
9条回答
  • 2020-12-15 18:20

    Is simply stripping tags and excess line breaks acceptable?

    html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')
    

    First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.

    0 讨论(0)
  • 2020-12-15 18:22

    if its in rails, you may use this:

    html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe
    
    0 讨论(0)
  • 2020-12-15 18:24

    if you are using rails you can: html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>' puts ActionView::Base.full_sanitizer.sanitize(html)

    0 讨论(0)
  • 2020-12-15 18:24

    Building slightly on Matchu's answer, this worked for my (very similar) requirements:

    html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish
    

    Hope it makes someone's life a bit easier :-)

    0 讨论(0)
  • 2020-12-15 18:28

    You could start with something like this:

    require 'open-uri'
    require 'rubygems'
    require 'nokogiri'
    
    uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
    doc = Nokogiri::HTML(open(uri))
    doc.css('script, link').each { |node| node.remove }
    puts doc.css('body').text.squeeze(" \n")
    
    0 讨论(0)
  • 2020-12-15 18:36
    require 'open-uri'
    require 'nokogiri'
    
    url = 'http://en.wikipedia.org/wiki/Wolfram_language'
    doc = Nokogiri::HTML(open(url))
    
    text = ''
    doc.css('p,h1').each do |e|
      text << e.content
    end
    
    puts text
    

    This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.

    0 讨论(0)
提交回复
热议问题