Scraping text without javascript code using scrapy

后端 未结 3 1557
野性不改
野性不改 2020-12-18 07:03

I\'m currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the

3条回答
  •  北海茫月
    2020-12-18 07:41

    You can try this XPath expression:

    hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
    

    i.e, all children text nodes of descendants of //td[@id='contenuStory'] that are not script nodes

    To add space between the text nodes you can use something like:

    u' '.join(
        hxs.select(
            '//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
    )
    

提交回复
热议问题