Is it possible for Scrapy to get plain text from raw HTML data?

后端未结

关注

 3  904

悲&欢浪女 2021-02-12 17:27

For example:

scrapy shell http://scrapy.org/
content = hxs.select(\'//*[@id=\"content\"]\').extract()[0]
print content

Then, I get the followin

3条回答

轮回少年 (楼主)

2021-02-12 18:15

Another solution using lxml.html's tostring() with parameter method="text". lxml is used in Scrapy internally. (parameter encoding=unicode is usually what you want.)

See http://lxml.de/api/lxml.html-module.html for details.

from scrapy.spider import BaseSpider
import lxml.etree
import lxml.html

class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # optionally remove tags that are not usually rendered in browsers
        # javascript, HTML/HEAD, comments, add the tag names you dont want at the end
        lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")

        # complete text
        print lxml.html.tostring(root, method="text", encoding=unicode)

        # or same as in alecxe's example spider,
        # pinpoint a part of the document using XPath
        #for p in root.xpath("//div[@id='mw-content-text']/p[1]"):
        #   print lxml.html.tostring(p, method="text")

0 讨论(0)

查看其它3个回答