:has CSS pseudo class in Nokogiri

一个人想着一个人 提交于 2019-12-11 09:34:51

问题


I'm looking for the pseudoclass :has in Nokogiri. It should work just like jQuery's has selector.

For example:

<li><h1><a href="dfd">ex1</a></h1><span class="string">sdfsdf</span></li>
<li><h1><a href="dsfsdf">ex2</a></h1><span class="string"></span></li>
<li><h1><a href="sdfd">ex3</a></h1></li>

The CSS selector should return only the first link, the one with the not-empty span.string sibling.

In jQuery this selector works well:

$('li:has(span.string:not(:empty))>h1>a')

but not in Nokogiri:

Nokogiri::HTML(html_source).css('li:has(span.string:not(:empty))>h1>a')

:not and :empty works well, but not :has.


  1. Is there any documentation for CSS selectors in Nokogiri?
  2. Maybe someone can write a custom :has pseudo class? Here is an example how to write a :regexp selector.
  3. Optionally I can use XPath. How do I write XPath for li:has(span.string:not(:empty))>h1>a?

回答1:


The problem with Nokogiri's current implementation of :has() is that it creates XPath that requires the contents to be a direct child, not any descendant:

puts Nokogiri::CSS.xpath_for( "a:has(b)" )
#=> "//a[b]"
#=> Should output "//a[.//b]" to be correct

To make this XPath match what jQuery does, you need to allow the span to be a descendant element. For example:

require 'nokogiri'
d = Nokogiri.XML('<r><a/><a><b><c/></b></a></r>')
d.at_css('a:has(b)')    #=> #<Nokogiri::XML::Element:0x14dd608 name="a" children=[#<Nokogiri::XML::Element:0x14dd3e0 name="b" children=[#<Nokogiri::XML::Element:0x14dd20c name="c">]>]>
d.at_css('a:has(c)')    #=> nil
d.at_xpath('//a[.//c]') #=> #<Nokogiri::XML::Element:0x14dd608 name="a" children=[#<Nokogiri::XML::Element:0x14dd3e0 name="b" children=[#<Nokogiri::XML::Element:0x14dd20c name="c">]>]>

For your specific case, here's the full "broken" XPath:

puts Nokogiri::CSS.xpath_for( "li:has(span.string:not(:empty)) > h1 > a" )
#=> //li[span[contains(concat(' ', @class, ' '), ' string ') and not(not(node()))]]/h1/a

And here it is fixed:

# Adding just the .//
//li[.//span[contains(concat(' ', @class, ' '), ' string ') and not(not(node()))]]/h1/a

# Simplified to assume only one CSS class is present on the span
//li[.//span[@class='string' and not(not(node()))]]/h1/a

# Assuming that `not(:empty)` really meant "Has some text in it"
//li[.//span[@class='string' and text()]]/h1/a

# ..or maybe you really wanted "Has some text anywhere underneath"
//li[.//span[@class='string' and .//text()]]/h1/a

# ..or maybe you really wanted "Has at least one element child"
//li[.//span[@class='string' and *]]/h1/a



回答2:


Nokogiri does not have a :has selector, here is the documentation on what it does do: http://ruby.bastardsbook.com/chapters/html-parsing/#h-2-2




回答3:


Ok, I found a solution that maybe will be useful for someone.

Custom pseudoclass :custom_has:

class MyCustomSelectors
  def custom_has node_set, selector
      node_set.find_all { |node| node.css(selector).present? }
  end
end

#usage:
doc.css('li:custom_has(span.string:not(:empty))>h1>a',MyCustomSelectors.new)

Why did I declar :custom_has not just :has? Because it's already declared. In the Nokogiri repo are tests for the :has selector, but they are not working. I reported this issue to the author.




回答4:


Nokogiri allows for chaining .css() and .xpath() calls on the same object. So any time you feel like using :has, just end the current .css() call and add .xpath(..) (the parent selector). You can even resume your selection with another .css() call starting where your xpath() left off!

Example:

Here's some HTML from wikipedia:

<tr>
    <th scope="row" style="text-align:left;">
        Origin
    </th>
    <td>
        <a href="/wiki/Edinburgh" title="Edinburgh">Edinburgh</a>
        <a href="/wiki/Scotland" title="Scotland">Scotland</a>
    </td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">
        <a href="/wiki/Music_genre" title="Music genre">Genres</a>
    </th>
    <td>
        <a href="/wiki/Electronica" title="Electronica">Electronica</a>
        <a href="/wiki/Intelligent_dance_music" title="Intelligent dance music">IDM</a>
        <a href="/wiki/Ambient_music" title="Ambient music">ambient</a>
        <a href="/wiki/Downtempo" title="Downtempo">downtempo</a>
        <a href="/wiki/Trip_hop" title="Trip hop">trip hop</a>
    </td>
</tr>
<tr>
    <th scope="row" style="text-align:left;">
        <a href="/wiki/Record_label" title="Record label">Labels</a>
    </th>
    <td>
        <a href="/wiki/Warp_(record_label)" title="Warp (record label)">Warp</a>
        <a href="/wiki/Skam_Records" title="Skam Records">Skam</a>
        <a href="/wiki/Music70" title="Music70">Music70</a>
    </td>
</tr>

Say you want to select all of the <a> elements inside of the first <td> that comes after the <th> containing the link with href="/Music_genre".

@artistPage.css("table th > a[href='/wiki/Music_genre']").xpath("..").css("+ td a")

This will return all of the <a>'s for each genre listing.

Now for good measure, let's grab the inner text of all those <a>'s and put them in an array.

@genreLinks = @artistPage.css("table th > a[href='/wiki/Music_genre']").xpath("..").css("+ td a")
@genres = []
@genreLinks.each do |genreLink|
  @genres.push(genreLink.text)
end


来源:https://stackoverflow.com/questions/11760171/has-css-pseudo-class-in-nokogiri

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!