How to scrape Javascript rendered websites using Javascript?

笑着哭i 提交于 2019-12-11 16:16:01

问题


I'm trying to scrape the $('a[href^="mailto:"]') of this website: https://celsius.network/

When I go to the browser console and run that, I get a link so I know it's there.

The issue is that my request (using the Axios library) returns the DOM before javascript is loaded. I've set the User-Agent, but it looks like it's not working.

const axiosClient = () =>
  axios.create({
    headers: {
      "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4"
    },
    timeout: 10000
  });


axiosClient()
  .get("https://celsius.network")
  .then(({ data }) => {
    console.log("DATAAAAAAAA: ", data);
  })

This is returning the original HTML, with the body:

<body>
  <div id="app"> </div>
  ....

instead of the one that's fully loaded after all the javascript has manipulated the DOM.

P.S. I am doing this through firebase functions, so I think there are limits to what I can install.

UPDATE

const findEmail = url =>
  new Promise((resolve, reject) => {
     // here!
  });

回答1:


Your request approach isn't enough to emulate what you'd expect while visiting a page in your browser. While there are some choices out there, puppeteer may be a candidate for the job.

Most things that you can do manually in the browser can be done using Puppeteer!

Check out the following...

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://celsius.network/');
  const textContent = await page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent);

  console.log(textContent); // presale@celsius.network

  browser.close();
})();

I'm not totally clear on your constraints...

there are limits to what I can install

If you have axios, I'd assume you can install this npm package?


Per your update, puppeteer can also be crafted via the promise api. The following should do it for you...

const findEmail = url =>
  new Promise((resolve, reject) => {
    puppeteer.launch().then((browser) => {
      browser.newPage().then((page) => {
        page.goto('https://celsius.network/').then(() => {
          page.evaluate(() => document.querySelector('a[href^="mailto:"]').textContent).then((element) => {
            resolve(element);
            browser.close();
          });
        });
      });
    });
  });

findEmail().then((email) => {
  console.log(email); // presale@celsius.network
});


来源:https://stackoverflow.com/questions/47191817/how-to-scrape-javascript-rendered-websites-using-javascript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!