Scrapy - how to manage pagination without 'Next' button?

烈酒焚心 提交于 2021-02-11 18:03:36

问题


I'm scraping the content of articles from a site like this where there is no 'Next' button to follow. ItemLoader is passed from parse_issue in the response.meta object as well as some additional data like section_name. Here is the function:

     def parse_article(self, response):
        self.logger.info('Parse function called parse_article on {}'.format(response.url))
        acrobat = response.xpath('//div[@class="txt__lead"]/p[contains(text(), "Plik do pobrania w wersji (pdf) - wymagany Acrobat Reader")]')
        limiter = response.xpath('//p[@class="limiter"]')
        if not acrobat and not limiter:
            loader = ItemLoader(item=response.meta['periodical_item'].copy(), response=response)
            loader.add_value('section_name', response.meta['section_name'])
            loader.add_value('article_url', response.url)
            loader.add_xpath('article_authors', './/p[@class="l doc-author"]/b')
            loader.add_xpath('article_title', '//div[@class="cf txt "]//h1')
            loader.add_xpath('article_intro', '//div[@class="txt__lead"]//p')
            article_content = response.xpath('.//div[@class=" txt__rich-area"]//p').getall()
            # # check for pagiantion
            next_page_url = response.xpath('//span[@class="pgr_nrs"]/span[contains(text(), 1)]/following-sibling::a[1]/@href').get()
            if next_page_url:
                # I'm not sure what should be here... Something like this: (???)
                yield response.follow(next_page_url, callback=self.parse_article, meta={
                'periodical_item' : loader.load_item(),
                'article_content' : article_content
                })
            else:
                loader.add_xpath('article_content', article_content)
                yield loader.load_item()

The problem is in parse_article function: I don't know how to combine the content of paragraphs from all pages into the one item. Does anybody know how to solve this?


回答1:


Your parse_article looks good. If the issue is just adding the article_content to the loader, you just needed to fetch it from the response.meta:

I would update this line:

article_content = response.meta.get('article_content', '') + response.xpath('.//div[@class=" txt__rich-area"]//p').getall()



回答2:


Just set the next page URL to iterate over X amount.

I noticed that article had 4 pages but some could be more

They are simply distinguished by adding /2 or /3 to the end of the URL e.g

https://www.gosc.pl/doc/791526.Zaloz-zbroje/
https://www.gosc.pl/doc/791526.Zaloz-zbroje/2
https://www.gosc.pl/doc/791526.Zaloz-zbroje/3

I don't use scrapy. But when I need multiple pages I would normally just iterate.

When you first scrape the page. Find the max amount of pages for that article first . On that site for example it says 1/4 so you know you will need 4 pages in total.

url = "https://www.gosc.pl/doc/791526.Zaloz-zbroje/"
data_store = ""
for i in range(1, 5):
    actual_url = "{}{}".format(url, I)
    scrape_stuff = content_you_want
    data_store += scrape_stuff

# format the collected data


来源:https://stackoverflow.com/questions/59446203/scrapy-how-to-manage-pagination-without-next-button

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!