Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that\'s about it.
If I write som
Is simply stripping tags and excess line breaks acceptable?
html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')
First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.
if its in rails, you may use this:
html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe
if you are using rails you can:
html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>'
puts ActionView::Base.full_sanitizer.sanitize(html)
Building slightly on Matchu's answer, this worked for my (very similar) requirements:
html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish
Hope it makes someone's life a bit easier :-)
You could start with something like this:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")
require 'open-uri'
require 'nokogiri'
url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))
text = ''
doc.css('p,h1').each do |e|
text << e.content
end
puts text
This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a
to the css classes in the block.