How to download a csv file using PhantomJS

大城市里の小女人 提交于 2019-12-17 09:51:53

问题


When I'm browsing a website A using normal browser (Chrome) and when I click on a link on the website A, Chrome imediatelly downloads report in a form of CSV file.

When I checked a server response headers I get the following results:

Cache-Control:private,max-age=31536000
Connection:Keep-Alive
Content-Disposition:attachment; filename="report.csv"
Content-Encoding:gzip
Content-Language:de-DE
Content-Type:text/csv; charset=UTF-8
Date:Wed, 22 Jul 2015 12:44:30 GMT
Expires:Thu, 21 Jul 2016 12:44:30 GMT
Keep-Alive:timeout=15, max=75
Pragma:cache
Server:Apache
Transfer-Encoding:chunked
Vary:Accept-Encoding

Now, I want to download and parse this file using PhantomJS. I set page onResourceReceived listener to see if Phantom will receive/download the file.

clientRequests.phantomPage.onResourceReceived = function(response) {
    console.log('Response (#' + response.id + ', stage "' + response.stage + '"): ' + JSON.stringify(response));
};

When I make Phantom request to download a file (this is page.open('URL OF THE FILE')), I can see in Phantom log that file is downloaded. Here are logs:

"contentType": "text/csv; charset=UTF-8",
    "headers": {
        "name": "Date",
        "value": "Wed, 22 Jul 2015 12:57:41 GMT"
    },
    "name": "Content-Disposition",
    "value": "attachment; filename=\"report.csv\"",
    "status":200,"statusText":"OK"

I received the file and its content, but how to access file data? When I print current PhantomJS page object, I get the HTML of the page A and I don't want that, I want CSV file, which I need to parse using JavaScript.


回答1:


I found a solution for PhantomJS. Reading through this discussion I found a jsfiddle which downloads a url via jQuery's ajax method and encodes the file as base64.

The file I wanted to download was plain text (CSV) so I have removed the encoding functions. My target page also already had jQuery included so I didn't need to inject jQuery into the target page.

My code assumes you have already opened the page you want to download the file from using PhantomJS, and that page has jQuery in it. In my case I had to first login to the site in order to get the download link.

var fs = require('fs');

var page=this;

var result = page.evaluate(function() {

    var out;
    $.ajax({
        'async' : false,
        'url' : 'fullurltodownload.csv',
        'success' : function(data, status, xhr) {
            out = data;
        }
    });
    return out;

});

fs.write('mydownloadedfile.csv', result);



回答2:


After days and days of investigation, I have to say that there are some solutions:

  • In your evaluate function you can make AJAX call to download and encode your file, then you can return this content back to phantom script
  • You can use some custom Phantom library available on some GitHub pages

If you need to download a file using PhanotmJS, then run away from PhantomJS and use CasperJS. CasperJS is based on PhantomJS, but it has much better and intuitive syntax and program flow.

Here is good post explaining "Why CasperJS is better than PhantomJS". In this post you can find section about file download.

How to download CSV file using CasperJS (this works even when server sends header Content-Disposition:attachment; filename='file.csv)

Here you can find some custom csv file available for download: http://captaincoffee.com.au/dump/items.csv

In order to download this file using CasperJS execute the following code:

var casper = require('casper').create();

casper.start("http://captaincoffee.com.au/dump/", function() {
    this.echo(this.getTitle())
});
casper.then(function(){
    var url = 'http://captaincoffee.com.au/dump/csv.csv';
    require('utils').dump(this.base64encode(url, 'get'));
});

casper.run();

The code above will download http://captaincoffee.com.au/dump/csv.csv CSV file and will print results as base64 string. So this way, you don't even have to download data to file, you have your data as base64 string.

If you explicitly want to download file to file system, you can use download function which is available in CasperJS.




回答3:


The previous 2 answers assume you can know in advance the URL of the final CSV file. That won't be the case if the link goes to an HTML page that does a Javascript-computed redirect to the file and you don't want to evaluate that Javascript outside of PhantomJS. Your options then are:

  1. put PhantomJS behind an upstream proxy, and use said upstream proxy to intercept the download URL (and its expected Cookie and Referer headers)—but you'd have to be careful to positively identify the real download URL and not some random data 'blob' if the page makes binary XMLHttpRequests as well;
  2. instead of PhantomJS use Headless Chrome which can automatically save downloaded files (or Firefox with PyVirtualDisplay, which can also be set to do this, or wait for Headless Firefox) and monitor the downloads directory—but you'd have to be able to figure out by yourself when the download has completed (or use an upstream proxy to monitor it for completion, but Headless Chrome/Firefox cannot currently be set to ignore SSL certificates, which means if the site goes "secure" it's much more difficult to monitor the requests of Headless Chrome/Firefox than it is to monitor the requests of PhantomJS, at least until Chromium issue 721739 is fixed; you could watch a CONNECT request but if it's kept alive you will have no way of knowing for sure that a transfer has finished);
  3. put PhantomJS behind an upstream proxy that changes all unknown content types to text/plain and deletes Content-Disposition headers, so you can read the file from PhantomJS in the normal way—that should work for a CSV file but won't work for binaries with 0-bytes in them.

The first of these options (PhantomJS + upstream proxy) is made easier if the upstream proxy can monitor the Accept header that PhantomJS sends to the remote site. At least in PhantomJS version 2.1.1, main requests have Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8, stylesheet requests have Accept: text/css,*/*;q=0.1, and all other requests (images, scripts, XMLHttpRequest) default to Accept: */* although this can be overridden by sites that use XMLHttpRequest.setRequestHeader(). Therefore if the upstream proxy sees a request with an Accept header containing text/html, and passing on this request to the server results in a CSV file or other non-HTML document, then there's a good chance this is the one to save.



来源:https://stackoverflow.com/questions/31564215/how-to-download-a-csv-file-using-phantomjs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!