XPath text with children

前端 未结 2 1405
醉话见心
醉话见心 2021-01-23 07:16

Given this html:

2条回答
  •  庸人自扰
    2021-01-23 07:52

    XPath generally cannot select what is not there. These things do not exist in your HTML:

    [
        'This is a link',
        'This is another link.'
    ]
    

    They might exist conceptually on the higher abstraction level that is the browser's rendering of the source code, but strictly speaking even there they are separate, for example in color and functionality.

    On the DOM level there are only separate text nodes and that's all XPath can pick up for you.

    Therefore you have three options.

    1. Select the text() nodes and join their individual values in Python code.
    2. Select the
    3. elements and for each of them, evaluate string(.) or normalize-space(.) with Scrapy. normalize-space() would deal with whitespace the way you would expect it.
    4. Select the
    5. elements and access their .text property – which internally finds all descendant text nodes and joins them for you.

    Personally I would go for the latter with //ul/li as my basic XPath expression as this would result in a cleaner solution.


    As @paul points out in the comments, Scrapy offers a nice fluent interface to do multiple processing steps in one line of code. The following code implements variant #2:

    selector = scrapy.Selector(text='''''')
    
    selector.css('ul > li').xpath('normalize-space()').extract()
    # --> [u'This is a link', u'This is another link.']
    

提交回复
热议问题