I\'m currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the
Try utils functions from w3lib.html
:
from w3lib.html import remove_tags, remove_tags_with_content
input = hxs.select('//div[@id="content"]').extract()
output = remove_tags(remove_tags_with_content(input, ('script', )))
You can try this XPath expression:
hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
i.e, all children text nodes of descendants of //td[@id='contenuStory']
that are not script
nodes
To add space between the text nodes you can use something like:
u' '.join(
hxs.select(
'//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
)
You can use after your xPath expression [not (ancestor-or-self::script]
.
This will not capture scripts but you can use it to prevent other things like [not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]
this will not capture any scripts, noscripts, or any css that is not part of the text.
Example:
//article//p//text()[not (ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]