Nokogiri grab text with formatting and link tags, ,, , etc

问题

How can I recursively capture all the text with formatting tags using Nokogiri?

<div id="1">
  This is text in the TD with <strong> strong </strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

For example, I would like to capture:

"This is text in the TD with <strong> strong </strong> tags" 

"This is a child node. with <b> bold </b> tags"

"another line of text to a <a href="link.html"> link </a>"

"This is text inside a div <em>inside<em> another div inside a paragraph tag"

I can't just use .text() because it strips the formatting tags and I'm not sure how to do it recursively.

ADDED DETAIL: Sanitize looks like an interesting gem, I'm reading it now. However, have some added info that might clarify what I need to do.

I need to traverse each node, get the text, process it and put it back. therefore I would grab the text from , "This is text in the TD with strong tags", modify it to something like, "This is the modified text in the TD with strong tags. Then goto the next tag from div 1 get the

text. "This is a child node. with bold tags" modify it "This is a modified child node. with bold tags." and put it back. Goto the next div#2 and grab the text, "another line of text to a link ", modify it, "another line of modified text to a link ", and put it back and goto the next node, Div#2 and grab text from the paragraph tag. "This is modified text inside a div inside another div inside a paragraph tag"

so after everything is processed the new html should be look like this...

<div id="1">
  This is modified text in the TD with <strong> strong </strong> tags
  <p>This is a modified child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of modified text to a <a href="link.html"> link </a>"
      <p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

My quasi-code,but I'm really stuck on the two parts, grabbing just the text with formatting (which sanitize helps with), but sanitize grabs all tags. I need to preserve formatting of just the text with formatting, including spaces, etc. However, not grab the unrelated tag children. And two, traversing down all the children related directly with full text tags.

#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
   #grab full text(full sentence and paragraphs) with formating tags
   #currently, I have not way to grab just the text with formatting and not the other tags
   modified_text=processing_code(i.full_text_w_formating())
   i.full_text_w_formating=modified_text
end

def processing_code(string)
#code to process string (not relevant for this example)
  return modified_string
end


# Recursive 1
class Nokogiri::XML::Node
  def descendant_elements
  #This is flawed because it grabs every child and even 
  #splits it based on any tag.
  # I need to traverse down only the text related children.
  element_children.map{ |kid|
     [kid, kid.descendant_elements]
  }.flatten
  end
 end

回答1:

I'd use two tactics, Nokogiri to extract the content you want, then a blacklist/whitelist program to strip tags you don't want or keep the ones you want.

require 'nokogiri'
require 'sanitize'

html = '
<div id="1">
  This is text in the TD with <strong> strong <strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>
'

doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html

will capture the contents of <div id="1"> as an HTML string:

      This is text in the TD with <strong> strong <strong> tags
      <p>This is a child node. with <b> bold </b> tags</p>
      <div id="2">
          "another line of text to a <a href="link.html"> link </a>"
          <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
      </div>
    </strong></strong>

The trailing </strong></strong> is the result of two opening <strong> tags. That might be deliberate, but with no closing tags Nokogiri will do some fixup to make the HTML correct.

Passing html_fragment to the Sanitize gem:

doc = Sanitize.clean(
  html_fragment,
  :elements   => %w[ a b em strong ],
  :attributes => {
    'a'    => %w[ href ],
  },
)

The returned text looks like:

 This is text in the TD with <strong> strong <strong> tags
  This is a child node. with <b> bold </b> tags 

      "another line of text to a <a href="link.html"> link </a>"
        This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em> 

</strong></strong>

Again, because the HTML was malformed with no closing </strong> tags, the two trailing closing tags are present.

来源：https://stackoverflow.com/questions/14224594/nokogiri-grab-text-with-formatting-and-link-tags-em-strong-a-etc

标签

ruby

recursion

nokogiri

Nokogiri grab text with formatting and link tags, <em>,<strong>, <a>, etc

问题

回答1: