Scraping text without javascript code using scrapy

后端未结

关注

 3  1557

野性不改 2020-12-18 07:03

I\'m currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the

3条回答

北海茫月 (楼主)

2020-12-18 07:41
You can try this XPath expression:
```
hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
```
i.e, all children text nodes of descendants of //td[@id='contenuStory'] that are not script nodes

To add space between the text nodes you can use something like:
```
u' '.join(
    hxs.select(
        '//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
)
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...