puppeteer being redirected when browser is not

天大地大妈咪最大 提交于 2021-02-11 14:20:55

问题


Attempting to test page https://publicindex.sccourts.org/anderson/publicindex/ When navigating with standard browser to the page, the navigation ends at the requested page (https://publicindex.sccourts.org/anderson/publicindex/) with the page displaying an "accept" button.

However, when testing with puppeteer in headless mode, the request is redirected to https://publicindex.sccourts.org.

I have a rough idea of what is occuring, but can not seem to prevent the redirection to https://publicindex.sccourts.org when the page is requested using puppeteer. here is what I believe is occuring with the user controlled browser:

  1. request for page is sent. (assuming first visit)

  2. the response is pure JS,

  3. The js code specifies to:

    copy the initial page request headers

    add a specific header, and re-request the same page (xhr)

    copies a url from one of the response headers and replaces the location

    (or)

    checks the page history,

    adds the url from the response to page to history,

    opens a new window,

    writes the xhr response to the new page

    closes the new window

    adds an event listener for a function in the returned xhr request

    fires the event

With puppeteer I have tried tracing the js, recording har, monitoring cookies, watching the request chain, intercepting page requests and adjusting headers,watching history....etc. I'm stumped.
Here is the most basic version of the puppeteer script:

function run () {
    let url = 'https://publicindex.sccourts.org/anderson/publicindex/';
    const puppeteer = require('puppeteer');
    const PuppeteerHar = require('puppeteer-har');
    puppeteer.launch({headless: true}).then(async browser => {
        const page = await browser.newPage();
        await page.setJavaScriptEnabled(true);
        await page.setViewport({width: 1920, height: 1280});
        await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36');
        const har = new PuppeteerHar(page);
        await har.start({path: 'results.har'});
        const response = await page.goto(url);
        await page.waitForNavigation();
        await har.stop();
        let bodyHTML = await page.content();
        console.log(bodyHTML);
    });
};
run();

why can I not get puppeteer to simply replicate the process that is being executed by js when I am navigating to the page in chrome, and end navigation on the "accept" page?

here is a version with more verbose logging:

function run () {
    let url = 'https://publicindex.sccourts.org/anderson/publicindex/';
    const puppeteer = require('puppeteer');
    const PuppeteerHar = require('puppeteer-har');
    puppeteer.launch().then(async browser => {

        const page = await browser.newPage();

        await page.setJavaScriptEnabled(true);
        await page.setViewport({width:1920,height:1280});
        await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36');
        await page.setRequestInterception(true);
        page.on('frameattached', frame =>{ console.log('frame attached ');});
        page.on('framedetached', frame =>{ console.log('frame detached ');});
        page.on('framenavigated', frame =>{ console.log('frame navigated '); });
        page.on('requestfailed', req =>{ console.log('request failed ');});
        page.on('requestfinished', req =>{ console.log('frame finished  '); console.log(req.url())});

        let count = 0;
        let headers = '';
            page.on('request', interceptedRequest => {
                console.log('requesting ' + count + 'times');
                console.log('request for  ' + interceptedRequest.url());
                console.log(interceptedRequest);
                if (count>2) {
                    interceptedRequest.abort();
                    return;
                }
                if (interceptedRequest.url() == url) {
                    count++;
                    if (count == 1) {
                        const headers = interceptedRequest.headers();
                        headers['authority'] = 'publicindex.sccourts.org';
                        headers['sec-fetch-dest'] = 'empty';
                        headers['sec-fetch-mode'] = 'cors';
                        headers['sec-fetch-site'] = 'same-origin';
                        headers['upgrade-insecure-requests'] = '1';
                        interceptedRequest.continue({headers});
                        return;
                    } else {
                        interceptedRequest.continue();
                        return;
                    }

                }
                count++;
                interceptedRequest.continue();
                return;
            });
            const har = new PuppeteerHar(page);
            await har.start({ path: 'results.har' });
            await page.tracing.start({path: 'trace.json'});
            await Promise.all([page.coverage.startJSCoverage({reportAnonymousScripts  : true})]);
            const response = await page.goto(url);
             const session = await page.target().createCDPSession();
             await session.send('Page.enable');
            await session.send('Page.setWebLifecycleState', {state: 'active'});
            const jsCoverage = await Promise.all([page.coverage.stopJSCoverage()]);
            console.log(jsCoverage);
            const chain = response.request().redirectChain();
            console.log(chain + "\n\n");
        await page.waitForNavigation();
        await har.stop();
        let bodyHTML = await page.content();
        console.log(bodyHTML);

    });
};

run();

回答1:


I don't have a full resolution but I know where the redirection is happening.

I tested your script locally with below:

const puppeteer = require('puppeteer');
const PuppeteerHar = require('puppeteer-har');

function run () {
    let url = 'https://publicindex.sccourts.org/anderson/publicindex/';
    puppeteer.launch({headless: false, devtools: true }).then(async browser => {
        const page = await browser.newPage();
        await page.setRequestInterception(true);
        page.on('request', request => {
            console.log('GOT NEW REQUEST', request.url());
            request.continue();
        });

        page.on('response', response => {
            console.log('GOT NEW RESPONSE', response.status(), response.headers());
        });
        await page.setJavaScriptEnabled(true);
        await page.setViewport({width: 1920, height: 1280});
        await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36');
        const har = new PuppeteerHar(page);
        await har.start({path: 'results.har'});
        const response = await page.goto(url);
        await page.waitForNavigation();
        await har.stop();
        let bodyHTML = await page.content();
    });
};
run();

I edited three parts:

  • Removed headless mode and open the devtools automatically
  • Intercept all network requests (that I audited)
  • Hoisted require import because it hurts my eyes. I always see them call without nesting

Turns out the page https://publicindex.sccourts.org/anderson/publicindex/ make a request to https://publicindex.sccourts.org/

However this request returns a 302 Redirect to https://www.sccourts.org/caseSearch/ location, so the browser acts accordingly

I would try to investigate this weird request if it is legit or not and why it redirects on chrome puppeteer

This post might help, there could be something related on chromium being seen as insecure

I also tried to pass args: ['--disable-web-security', '--allow-running-insecure-content'] to launch() object parameter, but without results

Please let us know how it goes! Har has been fun to discover!



来源:https://stackoverflow.com/questions/63203053/puppeteer-being-redirected-when-browser-is-not

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!