Scrapy-Recursively Scrape Webpages and save content as html file

我与影子孤独终老i 提交于 2019-12-22 10:53:57

问题


I am using scrapy to extract the information in tag of web pages and then save those webpages as HTML files.Eg http://www.austlii.edu.au/au/cases/cth/HCA/1945/ this site has some webpages related to judicial cases.I want to go to each link and save only the content related to the particular judicial case as an HTML page.eg go to this http://www.austlii.edu.au/au/cases/cth/HCA/1945/1.html and then save information related to case.

Is there a way to do this recursively in scrapy and save content in HTML page


回答1:


Yes, you can do it with Scrapy, Link Extractors will help:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector


class AustliiSpider(CrawlSpider):
    name = "austlii"
    allowed_domains = ["austlii.edu.au"]
    start_urls = ["http://www.austlii.edu.au/au/cases/cth/HCA/1945/"]
    rules = (
        Rule(SgmlLinkExtractor(allow=r"au/cases/cth/HCA/1945/\d+.html"), follow=True, callback='parse_item'),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        # do whatever with html content (response.body variable)

Hope that helps.



来源:https://stackoverflow.com/questions/17479744/scrapy-recursively-scrape-webpages-and-save-content-as-html-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!