Scraping text without javascript code using scrapy

后端未结

关注

 3  1555

I\'m currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the

相关标签:

3条回答

悲&欢浪女

2020-12-18 07:38

Try utils functions from w3lib.html:

from w3lib.html import remove_tags, remove_tags_with_content

input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))

0 讨论(0)

北海茫月

2020-12-18 07:41
You can try this XPath expression:
```
hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
```
i.e, all children text nodes of descendants of //td[@id='contenuStory'] that are not script nodes

To add space between the text nodes you can use something like:
```
u' '.join(
    hxs.select(
        '//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-18 07:55
You can use after your xPath expression [not (ancestor-or-self::script].

This will not capture scripts but you can use it to prevent other things like [not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)] this will not capture any scripts, noscripts, or any css that is not part of the text.

Example:
```
//article//p//text()[not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...