HTML to Plain Text with Ruby?

后端未结

关注

 9  2028

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that\'s about it.

If I write som

相关标签:

9条回答

忘掉有多难

2020-12-15 18:20
Is simply stripping tags and excess line breaks acceptable?
```
html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')
```
First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-12-15 18:22
if its in rails, you may use this:
```
html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
感动是毒

2020-12-15 18:24

if you are using rails you can: html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>' puts ActionView::Base.full_sanitizer.sanitize(html)

0 讨论(0)
发布评论:

提交评论
- 加载中...
猫巷女王i

2020-12-15 18:24
Building slightly on Matchu's answer, this worked for my (very similar) requirements:
```
html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish
```
Hope it makes someone's life a bit easier :-)
0 讨论(0)
发布评论:

提交评论
- 加载中...

忘了有多久

2020-12-15 18:28

You could start with something like this:

require 'open-uri'
require 'rubygems'
require 'nokogiri'

uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")

0 讨论(0)

情深已故

2020-12-15 18:36
```
require 'open-uri'
require 'nokogiri'

url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))

text = ''
doc.css('p,h1').each do |e|
  text << e.content
end

puts text
```
This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页