How can I get all the plain text from a website with Scrapy?
问题 I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? 回答1: The easiest option would be to extract //body//text() and join everything found: ''.join(sel.select("//body//text()").extract()).strip() where sel is a Selector instance. Another option is to use nltk's clean_html() : >>> import nltk >>>