Why LinkExtractor does not catch the links that was generated after AJAX requests?

三世轮回 提交于 2020-01-06 19:51:09

问题


I'm crawling a page that generates data with infinite scrolling. I'm using CrawlSpider, and the rules are defined like this:

rules = (
    Rule(LinkExtractor(restrict_xpaths = ('//*some/xpaths')), callback = 'parse_first_itmes', follow = True),
    Rule(LinkExtractor(restrict_xpaths = ('//*some/other/xpaths')), callback = 'parse_second_itmes'),
)

In the parse_item function, I have a Request makes the AJAX requests:

def parse_first_items(self, response):
    l = ItemLoader(item = AmazonCnCustomerItem(), response = response)

    l.add_xpath('field1', '//p[@class="field1")]/text()')
    l.add_xpath('field2', '//p[@class="field2")]/text()')

    r_url = l.get_xpath('//*/url/xpath/@href')

    r = Request(url = req_url,
                    headers = {"Referer": "the/same/page/url",
                        "X-Requested-With": "XMLHttpRequest"},
                    callback = self.parse_first_items)

    return r, l.load_item()

I get the desired data just fine, but the LinkExtractor in the Second Rule does not catch the urls from the data generated by the Request inside the the parse_first_itmes function.

How can I make the LinkExtractor in the Second Rule extract those links and use them to parse the parse_second_itmes function?

来源:https://stackoverflow.com/questions/38777236/why-linkextractor-does-not-catch-the-links-that-was-generated-after-ajax-request

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!