Python crawler does not work properly

跟風遠走 提交于 2019-12-25 09:27:32

问题


I'd just written a Python crawler to download midi files from freemidi.org. Looking at the request headers in Chrome, I found that the "Referer" attribute had to be https://freemidi.org/download-20225 (referred to as "download-20225" later) if the download page was https://freemidi.org/getter-20225 (referred to as "getter-20225" later) in order to download the midi file properly. I did so in Python, setting the header like this:

headers = {
    'Referer': 'https://freemidi.org/download-20225',
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}

which was exactly the same as the request header I had viewed in Chrome, and I tried to download the file using this line of code.

midi = requests.get(url, headers=headers).content

However, it did not work properly. Instead of downloading the midi file, it downloaded a html file of the site "download-20225". I later found that if I tried to access the site "getter-20225" directly, it takes me to "download-20225" as well. I think it probably indicates that the header was wrong, so it took me to the other website instead of starting the download.

I'm quite new to writing Python crawlers, so could someone help me find what went wrong with the program?


回答1:


It looks like the problem here is that the page with the midi file (e.g. "getter-20225") wants to redirect you back to the song page (e.g. "download-20225") after downloading the song. However, requests is only returning the content from the final page in the redirect.

You can set the allow_redirects parameter to False to have requests return the content from the "getter" page (i.e. the midi file):

midi = requests.get(url, headers=headers, allow_redirects=False)

Note that if you want to write the midi file to disk, you will need to open your target file in binary mode (since the midi file is written in bytes).

with open('example.mid', 'wb') as ex:
    ex.write(midi.content)


来源:https://stackoverflow.com/questions/44392748/python-crawler-does-not-work-properly

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!