Python Selenium PhantomJS - Extract download link of file that is being downloaded

问题

So as the title suggests, I am trying to get the direct link of a downloading file using PhantomJS through selenium in Python 3.7

The site I am working on is emuparadise.me, I am downloading a rom file with a request to this link after adding a cookie to avoid getting "Invalid Referer" error. When the request is made browser.current_url shows about:blank and I know that the file has started downloading by checking network usage for PhantomJS. Having been browsing the internet for over 3 hours now, I haven't found any way of retrieving the url of the downloading file.

One of my thoughts for a solution was creating a thread for tracking changes to browser.current_url but it seems like browser locks up when making the request

Here is my current code:

from selenium import webdriver


browser = webdriver.PhantomJS()
browser.add_cookie({'name': 'refexception', 'value': 1, 'domain': '.emuparadise.me', 'path': '/'})
browser.get("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true")

Note that I don't care at all about downloading the file, neither do I know or need to know where it's being downloaded. I've discovered the actual link for that specific example file from firefox in case you need it for testing. I also really prefer using PhantomJS over Firefox or Chrome web drivers for such a simple looking task. Any help would be highly appreciated.

回答1:

The php page is serving the file. You can't get the path or real filename on client side. (Added: now that I re-read your question I guess you did get the link client side!... you learn something new everyday! But, Selenium does not have acccess beyond the DOM.)

回答2:

So I finally came up to the solution. Since I know that the download url must be somewhere in the headers of my request, I searched for a way to view them for PhantomJS. It was pretty easy, indeed. All I did was change the log level from INFO(default) to DEBUG and the headers appeared in the log file under the events page.onResourceRequested and page.onResourceReceived. After making the request, I am just parsing the log file looking for the latter event and scraping out the url. Here's the complete code:

from selenium import webdriver
from json import loads


def get_direct_url_for_game(url):
    browser = webdriver.PhantomJS(service_args=["--webdriver-loglevel=DEBUG"])
    browser.add_cookie({'name': 'refexception', 'value': 1, 'domain': '.emuparadise.me', 'path': '/'})
    browser.get(download_url)

    direct_download_url = None
    with open('ghostdriver.log') as logs:
        for line in logs:
            _, _, event, event_data = line.split(" - ")
            if event == "page.onResourceReceived":
                event_data = loads(event_data)
                if event_data['contentType'] == "application/octet-stream":
                    direct_download_url = event_data['url']
                    browser.quit()
    return direct_download_url


print(get_url_for_game("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true"))

EDIT:

I actually found out a much simpler way of achieving the exact same thing much easier and more elegantly using requests' head function. This is making a request for the HTTP Headers of the url, hence the name, and we will still pass in the same cookie. We will allow redirects since that's what we want and the url will be under the variable url of the request.

Here's a look at it:

from requests import head


def get_direct_url_for_game(url):
    request = head(game_url, allow_redirects=True, cookies={'refexception': '1'})
    return request.url


print(get_direct_url_for_game("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true"))

来源：https://stackoverflow.com/questions/56281658/python-selenium-phantomjs-extract-download-link-of-file-that-is-being-download

标签

python

python-3.x

selenium

selenium-webdriver

phantomjs