Writing a crawler to parse a site in scrapy using BaseSpider

ⅰ亾dé卋堺 提交于 2020-01-06 02:57:07

问题


I am getting confused on how to design the architecure of crawler.

I have the search where I have

  1. pagination: next page links to follow
  2. a list of products on one page
  3. individual links to be crawled to get the description

I have the following code:

def parse_page(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ol[@id=\'result-set\']/li')
    items = []
    for site in sites[:2]:

        item = MyProduct()
        item['product'] = myfilter(site.select('h2/a').select("string()").extract())
        item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
        if  item['profile_link']:
                      request =  Request(urljoin('http://www.example.com', item['product_link']),
                      callback = self.parseItemDescription)

        request.meta['item'] = item
        return request

    soup = BeautifulSoup(response.body)
    mylinks= soup.find_all("a", text="Next")
    nextlink = mylinks[0].get('href')
    yield Request(urljoin(response.url, nextlink), callback=self.parse_page)

The problem is that I have two return statements: one for request, and one for yield.

In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.

What should I do?


回答1:


As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.



来源:https://stackoverflow.com/questions/13874992/writing-a-crawler-to-parse-a-site-in-scrapy-using-basespider

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!