I\'m currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the
You can try this XPath expression:
hxs.select('//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
i.e, all children text nodes of descendants of //td[@id='contenuStory']
that are not script
nodes
To add space between the text nodes you can use something like:
u' '.join(
hxs.select(
'//td[@id="contenuStory"]/descendant-or-self::*[not(self::script)]/text()').extract()
)