I\'m currently setting up a bunch of spiders using scrapy. These spiders are supposed to extract only text (articles, forum posts, paragraphs, etc) from the
Try utils functions from w3lib.html:
w3lib.html
from w3lib.html import remove_tags, remove_tags_with_content input = hxs.select('//div[@id="content"]').extract() output = remove_tags(remove_tags_with_content(input, ('script', )))