HTML to Plain Text with Ruby?

后端 未结 9 1998
无人共我
无人共我 2020-12-15 18:03

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that\'s about it.

If I write som

相关标签:
9条回答
  • 2020-12-15 18:38

    Actually, this is much simpler:

    require 'rubygems'
    require 'nokogiri'
    
    puts Nokogiri::HTML(my_html).text
    

    You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.

    0 讨论(0)
  • 2020-12-15 18:40

    You want hpricot_scrub:

    http://github.com/UnderpantsGnome/hpricot_scrub

    You can specify which tags to strip / keep in a config hash.

    0 讨论(0)
  • 2020-12-15 18:41

    I'm using the sanitize gem.

    (" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")
    

    It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.

    0 讨论(0)
提交回复
热议问题