Scrapy Last Page is not null and after page 146 last page is showing again

江枫思渺然 提交于 2020-08-09 08:14:43

问题


The website has 146 pages with words but after page 146 the last page is showing again. `

     if next_page is not None:

         yield response.follow(next_page, callback = self.parse)`

With this method sprider is not stoping at page 146 and it continues because page 147,148,149..is same as page 146. I tried to use for loop but that not worked. Also, I tried to take the value in next page button and break the function with next_extract. By the way output of next_extract is ['kelimeler.php?s=1']and the number increases with the page number like ['kelimeler.php?s=2']. Also, this way is not worked.

         next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
     next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()

     print(next_page)
     print(next_extract)




     
     if next_extract is 'kelimeler.php?s=147':
         break
     if next_page is not None:
         yield response.follow(next_page, callback = self.parse)

What should I do to stop the scrapying at page 146?

That's the whole parse function

     def parse(self,response):

     items = TidtutorialItem()

     all_div_kelimeler = response.css('a.collapsed')

     for tid in all_div_kelimeler:

         kelime = tid.css('a.collapsed::text').extract()
         link= tid.css('a.collapsed::text').xpath("@href").extract()


         items['Kelime'] = kelime
         items['Link'] = link

         yield items

     next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
     next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()

     print(next_page)
     print(next_extract)


     if next_page is not None:
     #if next_extract is not 'kelimeler.php?s=2':
     #for i in range (10):
         yield response.follow(next_page, callback = self.parse)

回答1:


I can't be very precise about the best approach without seeing the page, but I can giv you some suggestions.

     next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
     next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()

I'm not sure what you are trying to accomplish here, as both the selectors are essentially the same, except that the second one you are using the .extract() method, which returns a LIST. And since it returns a list this following line will ALWAYS fail:

    if next_extract is 'kelimeler.php?s=147':
        break

Another important point is that break is meant to be used inside a loop, so if the if statement ever resolved into True, this would cause an exception. Read more here.

Again, without seeing the page I can't say this for sure, but I believe this would acomplish what you are trying to do:

    if next_page == 'kelimeler.php?s=147':
         return

Notice next_page instead of next_extract. If you want to use the latter, remember it is a list, not a string.



来源:https://stackoverflow.com/questions/63024356/scrapy-last-page-is-not-null-and-after-page-146-last-page-is-showing-again

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!