Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that\'s about it.
If I write som
Actually, this is much simpler:
require 'rubygems'
require 'nokogiri'
puts Nokogiri::HTML(my_html).text
You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.
You want hpricot_scrub:
http://github.com/UnderpantsGnome/hpricot_scrub
You can specify which tags to strip / keep in a config hash.
I'm using the sanitize gem.
(" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")
It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.