Puppeteer - How can I get the current page (application/pdf) as a buffer or file?

泄露秘密 提交于 2019-12-11 06:01:47

问题


Using Puppeteer (https://github.com/GoogleChrome/puppeteer), I have a page that's a application/pdf. With headless: false, the page is loaded though the Chromium PDF viewer, but I want to use headless. How can I download the original .pdf file or use as a blob with another library, such as (pdf-parse https://www.npmjs.com/package/pdf-parse)?


回答1:


Since Puppeteer does not currently support navigation to a PDF document in headless mode via page.goto() due to the upstream issue, you can use page.setRequestInterception() to enable request interception, and then you can listen for the 'request' event and detect whether the resource is a PDF before using the request client to obtain the PDF buffer.

After obtaining the PDF buffer, you can use request.abort() to abort the original Puppeteer request, or if the request is not for a PDF, you can use request.continue() to continue the request normally.

Here's a full working example:

'use strict';

const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.setRequestInterception(true);

  page.on('request', request => {
    if (request.url().endsWith('.pdf')) {
      request_client({
        uri: request.url(),
        encoding: null,
        headers: {
          'Content-type': 'applcation/pdf',
        },
      }).then(response => {
        console.log(response); // PDF Buffer
        request.abort();
      });
    } else {
      request.continue();
    }
  });

  await page.goto('https://example.com/hello-world.pdf').catch(error => {});

  await browser.close();
})();



回答2:


Grant Miller's solution didn't work for me because I was logged in the website. But if the pdf is public this solution works out well.

The solution for my case was to add the cookies

await page.setRequestInterception(true);

page.on('request', async request => {
    if (request.url().indexOf('exibirFat.do')>0) { //This condition is true only in pdf page (in my case of course)
      const options = {
        encoding: null,
        method: request._method,
        uri: request._url,
        body: request._postData,
        headers: request._headers
      }
      /* add the cookies */
      const cookies = await page.cookies();
      options.headers.Cookie = cookies.map(ck => ck.name + '=' + ck.value).join(';');
      /* resend the request */
      const response = await request_client(options);
      //console.log(response); // PDF Buffer
      buffer = response;
      let filename = 'file.pdf';
      fs.writeFileSync(filename, buffer); //Save file
   } else {
      request.continue();
   }
});



来源:https://stackoverflow.com/questions/53487375/puppeteer-how-can-i-get-the-current-page-application-pdf-as-a-buffer-or-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!