Returning Items in scrapy's start_requests()

五迷三道 提交于 2021-02-04 18:59:50

问题


I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests() method.

Some URLs can be classified without downloading them, so I would like to yield directly an Item for them in start_requests(), which is forbidden by scrapy. How can I circumvent this?

I have thought about catching these requests in a custom middleware that would turn them into spurious Response objects, that I could then convert into Item objects in the request callback, but any cleaner solution would be welcome.


回答1:


I think using a spider middleware and overwriting the start_requests() would be a good start.

In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.

  • For your special URLs which do not require a request, you can
    • directly call your pipeline's process_item(), do not forget to import your pipeline and create a scrapy.item from your url for this
    • as you mentioned, pass the url as meta in a Request, and have a separate parse function which would only return the url
  • For all remaining URLs, your can launch a "normal" Request as you probably already have defined


来源:https://stackoverflow.com/questions/35300052/returning-items-in-scrapys-start-requests

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!