Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that\'s about it.
If I write som
require 'open-uri'
require 'nokogiri'
url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))
text = ''
doc.css('p,h1').each do |e|
text << e.content
end
puts text
This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.