问题
I don't have a specific code issue I'm just not sure how to approach the following problem logistically with the Scrapy framework:
The structure of the data I want to scrape is typically a table row for each item. Straightforward enough, right?
Ultimately I want to scrape the Title, Due Date, and Details for each row. Title and Due Date are immediately available on the page...
BUT the Details themselves aren't in the table -- but rather, a link to the page containing the details (if that doesn't make sense here's a table):
|-------------------------------------------------|
| Title | Due Date |
|-------------------------------------------------|
| Job Title (Clickable Link) | 1/1/2012 |
| Other Job (Link) | 3/2/2012 |
|--------------------------------|----------------|
I'm afraid I still don't know how to logistically pass the item around with callbacks and requests, even after reading through the CrawlSpider section of the Scrapy documentation.
回答1:
Please, first read the docs to understand what i say.
The answer:
To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter.
how do i merge results from target page to current page in scrapy?
回答2:
An example from scrapy documentation:
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
yield item
回答3:
You can also use Python functools.partial to pass an item or any other serializable data via additional arguments to the next Scrapy callback.
Something like:
import functools
# Inside your Spider class:
def parse(self, response):
# ...
# Process the first response here, populate item and next_url.
# ...
callback = functools.partial(self.parse_next, item, someotherarg)
return Request(next_url, callback=callback)
def parse_next(self, item, someotherarg, response):
# ...
# Process the second response here.
# ...
return item
来源:https://stackoverflow.com/questions/9334522/scrapy-follow-link-to-get-additional-item-data