How to get all links from the DOM?

时光怂恿深爱的人放手 提交于 2020-12-29 06:51:22

问题


According to https://github.com/GoogleChrome/puppeteer/issues/628, I should be able to get all links from < a href="xyz" > with this single line:

const hrefs = await page.$$eval('a', a => a.href);

But when I try a simple:

console.log(hrefs)

I only get:

http://example.de/index.html

... as output which means that it could only find 1 link? But the page definitely has 12 links in the source code / DOM. Why does it fail to find them all?

Minimal example:

'use strict';
const puppeteer = require('puppeteer');

crawlPage();

function crawlPage() {
    (async () => {
	
	const args = [
            "--disable-setuid-sandbox",
            "--no-sandbox",
            "--blink-settings=imagesEnabled=false",
        ];
        const options = {
            args,
            headless: true,
            ignoreHTTPSErrors: true,
        };

	const browser = await puppeteer.launch(options);
        const page = await browser.newPage();
	await page.goto("http://example.de", {
            waitUntil: 'networkidle2',
            timeout: 30000
        });
     
	const hrefs = await page.$eval('a', a => a.href);
        console.log(hrefs);
		
        await page.close();
	await browser.close();
		
    })().catch((error) => {
        console.error(error);
    });;

}

回答1:


In your example code you're using page.$eval, not page.$$eval. Since the former uses document.querySelector instead of document.querySelectorAll, the behaviour you describe is the expected one.

Also, you should change your pageFunctionin the $$eval arguments:

const hrefs = await page.$$eval('a', as => as.map(a => a.href));



回答2:


The page.$$eval() method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to the page function.

Since a in your example represents an array, you will either need to specify which element of the array you want to obtain the href from, or you will need to map all of the href attributes to an array.

page.$$eval()

const hrefs = await page.$$eval('a', links => links.map(a => a.href));

Alternatively, you can also use page.evaluate() or a combination of page.$$(), elementHandle.getProperty(), or jsHandle.jsonValue() to achieve an array of all links from the page.

page.evaluate()

const hrefs = await page.evaluate(() => {
  return Array.from(document.getElementsByTagName('a'), a => a.href);
});

page.$$() / elementHandle.getProperty() / jsHandle.jsonValue()

const hrefs = await Promise.all((await page.$$('a')).map(async a => {
  return await (await a.getProperty('href')).jsonValue();
}));


来源:https://stackoverflow.com/questions/49492017/how-to-get-all-links-from-the-dom

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!