Scrap amazon all deals php curl?

一世执手 提交于 2019-12-01 09:46:00

问题


I want to scrap amazon all deals page

http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1

So i am using curl php

$request = 'http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1';
        $ch = curl_init();
        curl_setopt($ch,CURLOPT_URL,$request);
        curl_setopt($ch, CURLOPT_HEADER, false);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_TIMEOUT, 80);
        $file_source = curl_exec($ch);
        print_r($file_source);
        exit;

scrapping completed but response page content div empty. contents all came from dynamic ajax requests in amazon. how can i scrap the all deal products using php and curl

My response image link

Update Code

 $request = 'http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1';

        $header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
        /*$header[] = "Accept-Language: en-US,en;q=0.5";*/
    /*  $header[] = "Accept-Encoding: gzip, deflate";*/
        $header[] = 'Cookie: x-wl-uid=1vlKm5hBxhHPg37UgkrAPYZZaV0wv+T5knGezWJq0AIEWI30hJYp0XouddMIZeemj1LKAi9fDQq7aoFN+mbvlVYPTBQVLFdzs0aeTGWtiCY0Ay63L0ezPfZRKXQHC
/Wum4ywRviFW9es=; session-id-time=2082787201l; session-id=192-9168386-7231424; ubid-main=187-6710460-8617661
; session-token="+SFC4vDx/BvcD8D1Mdgeo2jtnTD0qPHF5j2nWNwbFGcRyW7/o4LBOmBHJosU5W0SgoAd6lhi0NZWg/6o5WE6o45k
+VCT5a5dgj0tltSEkBT80oWT0CDk+jCDEEhIcxnCe6aqkUn6soFiMJHIsMWujo4qyA6A70PC1xKGKdIFMUm3H0DGSdIMqITs4Mjb1
/1vY6GxnPeh5ncasxl+tUN2dHVwwJbj1ZrmyJdDxSDd8/o="; __utma=194891197.2101747155.1434117141.1434356635.1434362529
.4; __utmz=194891197.1434362529.4.4.utmccn=(referral)|utmcsr=stackoverflow.com|utmcct=/questions/11589556
/retrieving-an-amazon-stores-list-of-products-using-php|utmcmd=referral; x-main="Xi0312Ip8BrjoFoj6Zp9OLxDcU6kCvlm4DExlT5yNgHa2b3htenxvUsF2TZR3
?Fn"; s_pers=%20s_vnum%3D1866356399079%2526vn%253D2%7C1866356399079%3B%20s_invisit%3Dtrue%7C1434364356330
%3B%20s_nr%3D1434362556331-Repeat%7C1442138556331%3B; csm-hit=b-1RHERWP84F8S70KRQ903|1434453087266; preferred-geo
=national; UserPref=O9NYa0FpfOIAcRMnkQf7WL3LyhrjCsMBKgKfVxT4zK8uOTF5KjzPAwmz0DuVnfXhdkinEE1BEMgPn09eHwavl
+Hwl1BOSvjp1ewiG1iCXa0R77FsPOGbpq06MWB0MC7Wwff4gehUEAle5IfyFQqKGh1XvJ4YiMFsR2mwmyzzVJTo0WPGZzvvpCVLFmx22cRVwEi4sX8y
+IfEKu76B4p1GHPdZVo1HIwLooo8CT7lboNUi4Hhn6mhtyGCNEDLvWD8NII48Vd9EkcBjUpiSeNroRjYO9yNkj8SI3xJVI0befNipOfxAzPSnuQqeBpqm99bWArk9ZZl
+EM5QKzoPNJSF0FqVnnYavt4G6F/PHedaJVl8pU0A6N9lBjK6YZRFflyaoEYPtUW+nqK0xqO+YusAMAlhHBuW33KMdtt3i6oufQ4yTDqIgAiQ1ZTXcsb2tcu
; s_dslv=1434370132739; lc-main=en_US; aws-target-visitor-id=1434357190046-572838.22_02; aws-target-data
=%7B%22support%22%3A%221%22%7D; s_fid=7BB6DD9CE8128EC3-2A07290402DD6AF6; s_vn=1465893191447%26vn%3D1
; s_nr=1434370132733-New; s_vnum=1866370132735%26vn%3D1; skin=noskin; b2b-main=0';
        $header[] = "Connection: keep-alive";
        $reffer = 'http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1';
        $ch = curl_init();
        curl_setopt($ch,CURLOPT_URL,$request);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Firefox/38.0');
        curl_setopt($ch, CURLOPT_HTTPHEADER, $header); 
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_REFERER, $reffer);
        curl_setopt($ch, CURLOPT_TIMEOUT, 80);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 10);        
        $file_source = curl_exec($ch);

        print_r($file_source);

回答1:


Based on my quick reseach you might query XHRs made by amazon to request deals.

Such dynamic websites get their data thru Ajax JSON calls. One might try to find out where from the data is dynamically downloaded, (using dev. tools or web sniffer), and then query those urls for data.

See the shot. But if you to query them with php Curl you should use/imitate the http headers of that particular request headers (including cookies):

Update

Based on your new curl request...

  1. The amazon page (its js logic) makes XHR to its server for each product item. XHRs look like this: http://www.amazon.com/xa/dealcontent/v2/GetDealMetadata?nocache=1434445645152 not http://www.amazon.com/gp/goldbox/all-deals/ref=sv_gb_1 which is only the referer.

  2. A request for product item is POST, not GET.

  3. You probably got cookie from your browser and inserted it into the php curl header. Wrong. These cookie are of your browser session, not related to a session of your php server that will requests XHRs. Therefore for this use cookie jar, see the post.
  4. The POST's load is an object, should be formed with known structure. Form data: {"requestMetadata":{"marketplaceID":"ATVPDKIKX0DER","sessionID":"175-4567874-0146849","clientID":"goldbox"},"widgetContext":{"pageType":"GoldBox","subPageType":"AllDeals","deviceType":"pc","refRID":"1VFVJBKEYZT3DGWSANXQ","widgetID":"1969939662","slotName":"center-6"},"page":1,"dealsPerPage":8,"itemResponseSize":"NONE","queryProfile":{"featuredOnly":false,"dealTypes":["LIGHTNING_DEAL","BEST_DEAL"],"includedCategories":["283155","599858","154606011"],"excludedExtendedFilters":{"MARKETING_ID":["restrictedcontent"]}}}

See the developer tools picture:

  1. As Michael - sqlbot mentioned, you try to do an action that violates Amazon's terms of Use. But for the scrape technique's sake I still update my answer.


来源:https://stackoverflow.com/questions/30861112/scrap-amazon-all-deals-php-curl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!