How to perform unauthenticated Instagram web scraping in response to recent private API changes?

我的梦境 提交于 2019-11-27 06:08:54

Values to persist

You aren't persisting the User Agent (a requirement) in the first query to Instagram:

const initResponse = await superagent.get('https://www.instagram.com/');

Should be:

const initResponse = await superagent.get('https://www.instagram.com/')
                     .set('User-Agent', userAgent);

This must be persisted in each request, along with the csrftoken cookie.

X-Instagram-GIS header generation

As your answer shows, you must generate the X-Instagram-GIS header from two properties, the rhx_gis value which is found in your initial request, and the query variables in your next request. These must be md5 hashed, as shown in your function above:

const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};

So in order to call instagram query you need to generate x-instagram-gis header.

To generate this header you need to calculate a md5 hash of the next string "{rhx_gis}:{path}". The rhx_gis value is stored in the source code of instagram page in the window._sharedData global js variable.

Example:
If you try to GET user info request like this https://www.instagram.com/{username}/?__a=1
You need to add http header x-instagram-gis to request which value is
MD5("{rhx_gis}:/{username}/")

This is tested and works 100%, so feel free to ask if something goes wrong.

Uhm... I don't have Node installed on my machine, so I cannot verify for sure, but looks like to me that you are missing a crucial part of the parameters in querystring, that is the after field:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 4,
    after: "YOUR_END_CURSOR"
});

From those queryVariables depend your MD5 hash, that, then, doesn't match the expected one. Try that: I expect it to work.

EDIT:

Reading carefully your code, it doesn't make much sense unfortunately. I infer that you are trying to fetch the full stream of pictures from a user's feed.

Then, what you need to do is not calling the Instagram home page as you are doing now (superagent.get('https://www.instagram.com/')), but rather the user's stream (superagent.get('https://www.instagram.com/your_user')).

Beware: you need to hardcode the very same user agent you're going to use below (and it doesn't look like you are...).

Then, you need to extract the query ID (it's not hardcoded, it changes every few hours, sometimes minutes; hardcoding it is foolish – however, for this POC, you can keep it hardcoded), and the end_cursor. For the end cursor I'd go for something like this:

const endCursor = (RegExp('end_cursor":"([^"]*)"', 'g')).exec(initResponse.text)[1];

Now you have everything you need to make the second request:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9,
    after: endCursor
});

const signature = generateRequestSignature(rhxGis, csrfTokenCookie, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'Accept': '*/*',
        'Accept-Language': 'en-US',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'close',
        'X-Instagram-GIS': signature,
        'Cookie': `rur=${rurCookie};csrftoken=${csrfTokenCookie};mid=${midCookie};ig_pr=1`
    }).send();

query_hash is not constant and keep changing over time.

For example ProfilePage scripts included these scripts:

https://www.instagram.com/static/bundles/base/ConsumerCommons.js/9e645e0f38c3.js https://www.instagram.com/static/bundles/base/Consumer.js/1c9217689868.js

The hash is located in one of the above script, e.g. for edge_followed_by:

const res = await fetch(scriptUrl, { credentials: 'include' });
const rawBody = await res.text();
const body = rawBody.slice(0, rawBody.lastIndexOf('edge_followed_by'));
const hashes = body.match(/"\w{32}"/g);
// hashes[hashes.length - 2]; = edge_followed_by
// hashes[hashes.length - 1]; = edge_follow
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!