Amazon Seller Central Login Scrape PhantomJS + CasperJS

会有一股神秘感。 提交于 2019-12-24 12:44:30

问题


I want to start off by saying that we only scrape our own account, because my company needs data from our own dashboard that we can't get from the MWS APIs. I am very familiar with those APIs.

I've had login/scraping scripts for years. But recently Amazon started offering up captchas. My old way of scraping was from PHP making cURL requests to mimic the browser.

My new approach is using PhantomJS and CasperJS to achieve the same effect. Everything was working fine for a day, but I'm getting captcha again.

Now, I happen to know from internal sources that Amazon isn't doing any scrape detection. They do however do hacking / DDOS attack detection. So I think something about this casperJS code is getting flagged as an attack.

I don't think I'm calling the script too often. And I've changed my IP address that the requests are coming from.

Here is some casperJS code

var fs = require('fs');
var casper = require('casper').create({
    pageSettings: {
        loadImages: false,
        loadPlugins: false,
        userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
    }
});

// use any cookies
var cookieFilename = "cookies/_cookies.txt";
var data = fs.read(cookieFilename);
if(data) {
    phantom.cookies = JSON.parse(data);
}

//First step is to open Amazon
casper.start("https://sellercentral.amazon.com/gp/homepage.html", function() {
    console.log("Amazon website opened");
});

casper.wait(1000, function() {
    if(this.exists("form[name=signinWidget]")) {
        console.log("need to login");
        //Now we have to populate username and password, and submit the form
        casper.wait(1000, function(){
            console.log("Login using username and password");
            this.evaluate(function(){
                document.getElementById("username").value="*****";
                document.getElementById("password").value="*****";
                document.querySelector("form[name=signinWidget]").submit();
            });
        });
        // write the cookies
        casper.wait(1000, function() {
            var cookies = JSON.stringify(phantom.cookies);
            fs.write(cookieFilename, cookies, 644);
        })
    } else {
        console.log("already logged in");
    }
});


//Wait to be redirected to the Home page, and then make a screenshot
casper.wait(1000, function(){
    console.log("is login found?");
    console.log(this.exists("form[name=signinWidget]"));
    this.echo(this.getPageContent());
});

casper.run();

The result of that last line is just a login page with captcha. What gives? This should be a normal browser. When I use the same login on my computer, I get no issues at all.

I've also tried several different user agent strings. Sometimes changing those works temporarily.

Also, when I load all this locally, it works fine. But on the linux server it get's the captcha. Note that I've changed the IP on the remote linux server many times. It still get's the captcha.


回答1:


As it often happens with scraping/automation the reason for errors is not necessarily incorrectly written script, but also the context, underlying infrastructure.

In this case we determined (in comments) that the script was challenged with captcha only when run from a particular server, IP-address of which seems to have been put in an untrusted list.



来源:https://stackoverflow.com/questions/34078639/amazon-seller-central-login-scrape-phantomjs-casperjs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!