How can I get the first element's text using Nokogiri?

ぐ巨炮叔叔 提交于 2019-12-13 17:17:07

问题


I am trying to get the text for Last sold date from this HTML:

<td class="browse-cell-date">

    <span title="Last sold date">
        May 2002 
    </span>

    <button class="btn btn-previous-sales js-btn-previous-sales">
        Previous sales (1) <i class="icon icon-down-open-1"/>
    </button>

    <div class="previous-sales-panel is-hidden">
        <span style="display: block;">
            Aug 1997
            <span class="fright">£60,000</span>
        </span>
    </div>

</td>

I tried:

    date = val.search(".//td[@class='browse-cell-date']").children[1]

It gave me the span I wanted but after adding .text to it, did not returned anything.


回答1:


I'd start with:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
    <td class="browse-cell-date">

        <span title="Last sold date">
            May 2002 
        </span>

        <button class="btn btn-previous-sales js-btn-previous-sales">
            Previous sales (1) <i class="icon icon-down-open-1"/>
        </button>

        <div class="previous-sales-panel is-hidden">
            <span style="display: block;">
                Aug 1997
                <span class="fright">£60,000</span>
            </span>
        </div>

    </td>
EOT

sold_date = doc.at('span[title="Last sold date"]') # => #<Nokogiri::XML::Element:0x3ffc7e84c35c name="span" attributes=[#<Nokogiri::XML::Attr:0x3ffc7e84c2f8 name="title" value="Last sold date">] children=[#<Nokogiri::XML::Text:0x3ffc7e82bc10 "\n            May 2002 \n        ">]>
sold_date.text # => "\n            May 2002 \n        "
sold_date.text.strip # => "May 2002"

So

doc.at('span[title="Last sold date"]').text.strip # => "May 2002"

will do it.

at is like search('some selector').first so use it for convenience. Both at and search are smart enough to figure out whether the selector is CSS or XPath most of the time so I use those. If Nokogiri is fooled I'll revert to using one of the *_css or *_xpath variants.

Alternately you could use:

doc.at('td.browse-cell-date span').text.strip # => "May 2002"
doc.at('td.browse-cell-date > span').text.strip # => "May 2002"

Note: Using text with any of the search, xpath or css methods isn't a good idea. Those methods return a NodeSet, which doesn't do what you expect when you use its text method. Consider these examples:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
    <body>
        <p>foo</p>
        <p>bar</p>
    </body>
</html>
EOT

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"

We regularly see questions where people have done this and then need to figure out how to split the concatenated text into something useful, which usually is very difficult.

99.99% of the time, you want to use the following map(&:text) to extract the text from a NodeSet:

doc.search('p').map(&:text) # => ["foo", "bar"]

But, in your use, simply use at, which returns a Node and then text will do what you expect.




回答2:


Try this

page.search(".//td").children[1].attr("title")


来源:https://stackoverflow.com/questions/39454863/how-can-i-get-the-first-elements-text-using-nokogiri

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!