Scraping text without javascript code using scrapy

后端 未结 3 1555
野性不改
野性不改 2020-12-18 07:03

I\'m currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the

相关标签:
3条回答
  • 2020-12-18 07:38

    Try utils functions from w3lib.html:

    from w3lib.html import remove_tags, remove_tags_with_content
    
    input = hxs.select('//div[@id="content"]').extract()
    output = remove_tags(remove_tags_with_content(input, ('script', )))
    
    0 讨论(0)
  • 2020-12-18 07:41

    You can try this XPath expression:

    hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
    

    i.e, all children text nodes of descendants of //td[@id='contenuStory'] that are not script nodes

    To add space between the text nodes you can use something like:

    u' '.join(
        hxs.select(
            '//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
    )
    
    0 讨论(0)
  • 2020-12-18 07:55

    You can use after your xPath expression [not (ancestor-or-self::script].

    This will not capture scripts but you can use it to prevent other things like [not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)] this will not capture any scripts, noscripts, or any css that is not part of the text.

    Example:

    //article//p//text()[not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]
    
    0 讨论(0)
提交回复
热议问题