Trouble clicking on different links using puppeteer

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-06 16:49:16

Problem:

Execution context was destroyed, most likely because of a navigation.

The error says you wanted to click some link, or do something on some page which does not exist anymore, most likely because of you navigated away.

Logic:

Think of the puppeteer script as a real human browsing the real page.

First, we load the url (https://stackoverflow.com/questions/tagged/web-scraping).

Next, we want to go through all questions asked on that page. To do that what would we normally do? We would do either of the following,

  • Open one link in a new tab. Focus on that new tab, finish our work and come back to the original tab. Continue next link.
  • We click on a link, do our stuff, go back to previous page, continue next one.

So both of them involves moving away from and coming back to current page.

If you don't follow this flow, you will get the error message as above.

Solution

There are at least 4 or more ways to resolve this. I will go with the simplest and complex ones.

Way: Link Extraction

First we extract all links on current page.

const links = await page.$$eval(".hyperlink", element => element.href);

This gives us a list of url. We can create a new tab for each link.

for(let link of links){
  const newTab = await browser.newPage();
  await newTab.goto(link);
  // do the stuff
  await newTab.close();
}

This will go through each link one by one. We could improve this by using promise.map and various queue libraries, but you get the idea.

Way: Coming back to main page

We will need to store the state somehow so we can know which link we visited last time. If we visited third question and came back to tag page, we need to visit the 4th question next time and vice versa.

Check the following code.

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  await page.goto(
    `https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pagesize=15`
  );

  const visitLink = async (index = 0) => {
    await page.waitFor("div.summary > h3 > a");

    // extract the links to click, we need this every time
    // because the context will be destryoed once we navigate
    const links = await page.$$("div.summary > h3 > a");
    // assuming there are 15 questions on one page,
    // we will stop on 16th question, since that does not exist
    if (links[index]) {
      console.log("Clicking ", index);

      await Promise.all([

        // so, start with the first link
        await page.evaluate(element => {
          element.click();
        }, links[index]),

        // either make sure we are on the correct page due to navigation
        await page.waitForNavigation(),
        // or wait for the post data as well
        await page.waitFor(".post-text")
      ]);

      const currentPage = await page.title();
      console.log(index, currentPage);

      // go back and visit next link
      await page.goBack({ waitUntil: "networkidle0" });
      return visitLink(index + 1);
    }
    console.log("No links left to click");
  };

  await visitLink();

  await browser.close();
})();

Result:

EDIT: There are multiple questions similar to this one. I will be referencing them in case you want to learn more.

Instead of clicking all the links cyclically, I find it better to parse all the links and then navigate to each of them reusing the same browser. Give it a shot:

const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch({headless:false});
    const [page] = await browser.pages();
    const base = "https://stackoverflow.com"
    await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
    let links = [];
    await page.waitFor(".summary .question-hyperlink");
    const sections = await page.$$(".summary .question-hyperlink");

    for (const section of sections) {
        const clink = await page.evaluate(el=>el.getAttribute("href"), section);
        links.push(`${base}${clink}`);
    }

    for (const link of links) {
        await page.goto(link);
        await page.waitFor('h1 > a');
    }
    await browser.close();
})();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!