nokogiri + mechanize css selector by text

问题

I am new to nokogiri and so far most familiar with CSS selectors, I am trying to parse information from a table, below is a sample of the table and the code I'm using, I'm stuck on the appropriate if statement, as it seems to return the whole contents of the table.

Table:

<div class="holder">
  <div class ="row">
   <div class="c1">
     <!-- Content I Don't need -->
   </div>
   <div class="c2">
    <span class="data">
     <!-- Content I Don't Need -->
    <span class="data">
   </div>
 </div>
 ...
 <div class="row">
  <div class="c1">
   SPECIFIC TEXT
  </div>
  <div class="c2">
   <span class="data">
    What I want
   </span>
  </div>
 </div>
</div>

My Script: (if SPECIFIC TEXT is found in the table it returns every "div.c2 span.data" variable - so I've either screwed up my knowledge of do loops or if statements)

data = []
page.agent.get(url)
page.search('div.row').each do |row_data|
 if (row_data.search('div.c1:contains("/SPECIFIC TEXT/")').text.strip
  temp = row_data.search('div.c2 span.data').text.strip
  data << temp
 end
end

回答1:

There's no need to stop and insert ruby logic when you can extract what you need in a single CSS selector.

data = page.search('div.row > div.c1:contains("SPECIFIC TEXT") + div.c2 span.data')

This will include only those that match the selector (e.g. follow the SPECIFIC TEXT).

Here's where your logic may have gone wrong:

This code

if (row_data.search('div.c1:contains("SPECIFIC TEXT")'...
  temp = row_data.search('div.c2 span.data')...

first searches the row for the specific text, then if it matches, returns ALL rows matching the second query, which has the same starting point. The key is the + in the CSS selector above which will return elements immediately following (e.g. the next sibling element). I'm making an assumption, of course, that the next element is always what you want.

回答2:

I'd do

require 'nokogiri'

html = <<_
<div class="holder">
  <div class ="row">
   <div class="c1">
     <!-- Content I Don't need -->
   </div>
   <div class="c2">
    <span class="data">
     <!-- Content I Don't Need -->
    <span class="data">
   </div>
 </div>
 <div class="row">
  <div class="c1">
   SPECIFIC TEXT
  </div>
  <div class="c2">
   <span class="data">
    What I want
   </span>
  </div>
 </div>
</div>
_

doc = Nokogiri::HTML(html)
css_string = 'div.row > div.c1[text()*="SPECIFIC TEXT"] + div.c2 span.data'
doc.at(css_string).text.strip
# => "What I want"

How those selectors would work here -

[name*="value"] - Selects elements that have the specified attribute with a value containing the a given substring.
Child Selector (“parent > child”) - Selects all direct child elements specified by "child" of elements specified by "parent".
Next Adjacent Selector (“prev + next”) - Selects all next elements matching "next" that are immediately preceded by a sibling "prev".
Class Selector (“.class”) - Selects all elements with the given class.
Descendant Selector (“ancestor descendant”) - Selects all elements that are descendants of a given ancestor.

来源：https://stackoverflow.com/questions/21856799/nokogiri-mechanize-css-selector-by-text

标签

ruby

parsing

css-selectors

nokogiri

mechanize