问题
So as the title suggests, I am trying to get the direct link of a downloading file using PhantomJS through selenium in Python 3.7
The site I am working on is emuparadise.me, I am downloading a rom file with a request to this link after adding a cookie to avoid getting "Invalid Referer" error. When the request is made browser.current_url
shows about:blank
and I know that the file has started downloading by checking network usage for PhantomJS. Having been browsing the internet for over 3 hours now, I haven't found any way of retrieving the url of the downloading file.
One of my thoughts for a solution was creating a thread for tracking changes to browser.current_url
but it seems like browser
locks up when making the request
Here is my current code:
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.add_cookie({'name': 'refexception', 'value': 1, 'domain': '.emuparadise.me', 'path': '/'})
browser.get("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true")
Note that I don't care at all about downloading the file, neither do I know or need to know where it's being downloaded. I've discovered the actual link for that specific example file from firefox in case you need it for testing. I also really prefer using PhantomJS over Firefox or Chrome web drivers for such a simple looking task. Any help would be highly appreciated.
回答1:
The php page is serving the file. You can't get the path or real filename on client side. (Added: now that I re-read your question I guess you did get the link client side!... you learn something new everyday! But, Selenium does not have acccess beyond the DOM.)
回答2:
So I finally came up to the solution. Since I know that the download url must be somewhere in the headers of my request, I searched for a way to view them for PhantomJS. It was pretty easy, indeed. All I did was change the log level from INFO
(default) to DEBUG
and the headers appeared in the log file under the events page.onResourceRequested
and page.onResourceReceived
. After making the request, I am just parsing the log file looking for the latter event and scraping out the url. Here's the complete code:
from selenium import webdriver
from json import loads
def get_direct_url_for_game(url):
browser = webdriver.PhantomJS(service_args=["--webdriver-loglevel=DEBUG"])
browser.add_cookie({'name': 'refexception', 'value': 1, 'domain': '.emuparadise.me', 'path': '/'})
browser.get(download_url)
direct_download_url = None
with open('ghostdriver.log') as logs:
for line in logs:
_, _, event, event_data = line.split(" - ")
if event == "page.onResourceReceived":
event_data = loads(event_data)
if event_data['contentType'] == "application/octet-stream":
direct_download_url = event_data['url']
browser.quit()
return direct_download_url
print(get_url_for_game("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true"))
EDIT:
I actually found out a much simpler way of achieving the exact same thing much easier and more elegantly using requests
' head
function. This is making a request for the HTTP Headers of the url, hence the name, and we will still pass in the same cookie. We will allow redirects since that's what we want and the url will be under the variable url
of the request.
Here's a look at it:
from requests import head
def get_direct_url_for_game(url):
request = head(game_url, allow_redirects=True, cookies={'refexception': '1'})
return request.url
print(get_direct_url_for_game("https://www.emuparadise.me/roms/get-download.php?gid=154652&test=true"))
来源:https://stackoverflow.com/questions/56281658/python-selenium-phantomjs-extract-download-link-of-file-that-is-being-download