How can I get the first element's text using Nokogiri?

问题

I am trying to get the text for Last sold date from this HTML:

<td class="browse-cell-date">

    <span title="Last sold date">
        May 2002 
    </span>

    <button class="btn btn-previous-sales js-btn-previous-sales">
        Previous sales (1) <i class="icon icon-down-open-1"/>
    </button>

    <div class="previous-sales-panel is-hidden">
        <span style="display: block;">
            Aug 1997
            <span class="fright">£60,000</span>
        </span>
    </div>

</td>

I tried:

    date = val.search(".//td[@class='browse-cell-date']").children[1]

It gave me the span I wanted but after adding .text to it, did not returned anything.

回答1:

I'd start with:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
    <td class="browse-cell-date">

        <span title="Last sold date">
            May 2002 
        </span>

        <button class="btn btn-previous-sales js-btn-previous-sales">
            Previous sales (1) <i class="icon icon-down-open-1"/>
        </button>

        <div class="previous-sales-panel is-hidden">
            <span style="display: block;">
                Aug 1997
                <span class="fright">£60,000</span>
            </span>
        </div>

    </td>
EOT

sold_date = doc.at('span[title="Last sold date"]') # => #<Nokogiri::XML::Element:0x3ffc7e84c35c name="span" attributes=[#<Nokogiri::XML::Attr:0x3ffc7e84c2f8 name="title" value="Last sold date">] children=[#<Nokogiri::XML::Text:0x3ffc7e82bc10 "\n            May 2002 \n        ">]>
sold_date.text # => "\n            May 2002 \n        "
sold_date.text.strip # => "May 2002"

doc.at('span[title="Last sold date"]').text.strip # => "May 2002"

will do it.

at is like search('some selector').first so use it for convenience. Both at and search are smart enough to figure out whether the selector is CSS or XPath most of the time so I use those. If Nokogiri is fooled I'll revert to using one of the *_css or *_xpath variants.

Alternately you could use:

doc.at('td.browse-cell-date span').text.strip # => "May 2002"
doc.at('td.browse-cell-date > span').text.strip # => "May 2002"

Note: Using text with any of the search, xpath or css methods isn't a good idea. Those methods return a NodeSet, which doesn't do what you expect when you use its text method. Consider these examples:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
    <body>
        <p>foo</p>
        <p>bar</p>
    </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"

We regularly see questions where people have done this and then need to figure out how to split the concatenated text into something useful, which usually is very difficult.

99.99% of the time, you want to use the following map(&:text) to extract the text from a NodeSet:

doc.search('p').map(&:text) # => ["foo", "bar"]

But, in your use, simply use at, which returns a Node and then text will do what you expect.

回答2:

Try this

page.search(".//td").children[1].attr("title")

来源：https://stackoverflow.com/questions/39454863/how-can-i-get-the-first-elements-text-using-nokogiri

标签

ruby

nokogiri