Puppeteer is unable to get the complete source code

做~自己de王妃 提交于 2019-12-11 00:08:07

问题


I'm creating a simple scraping application with Node.js and Puppeteer. The page I'm trying to scrape is this. Below is the code I'm using right now.

const url = `https://www.betrebels.gr/el/sports?catids=122,40,87,28,45,2&champids=423,274616,1496978,1484069,1484383,465990,465991,91,71,287,488038,488076,488075,1483480,201,2,367,38,1481454,18,226,440,441,442,443,444,445,446,447,448,449,451,452,453,456,457,458,459,460,278261&datefilter=TodayTomorrow&page=prelive`
await page.goto(url, {waitUntil: 'networkidle2'});
let content: string = await page.content();
await page.screenshot({path: 'page.png',fullPage: true});
await fs.writeFile("temp.html", content);
//...Analyze the html and other stuff.

The screenshot I'm getting is this which is what I'm expecting.

On the other hand, the page content is minimal and doesn't represent the data on the image.

Am I doing something wrong? Am I not waiting properly for the Javascript to finish?


回答1:


The page is using frames. You are only seeing the main content of the page (without the content of the frames). To also get the content of the frame, you need to first find the frame (e.g. via page.$) and then get its frame handle via elementHandle.contentFrame. You can then call frame.content() to get the content of the frame.

Simple Example

const frameElementHandle = await page.$('#selector iframe');
const frame = await frameElementHandle.contentFrame();
const frameContent = await frame.content();

Depending on the structure of the page, you need to do this for multiple frames to get all contents or you even need to do it for a frame inside the frame (what seems to be the case for the given page).

Example to read all frame contents

Below is an example that recursively read the contents of all frames on the page.

const contents = [];
async function extractFrameContents(pageOrFrame) {
  const frames = await pageOrFrame.$$('iframe');
  for (let frameElement of frames) {
    const frame = await frameElement.contentFrame();
    const frameContent = await frame.content();

    // do something with the content, example:
    contents.push(frameContent);

    // recursively repeat
    await extractFrameContents(frame); 
  }
}
await extractFrameContents(page);


来源:https://stackoverflow.com/questions/55994249/puppeteer-is-unable-to-get-the-complete-source-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!