问题
We use a custom scraper that have to take a separate website for a language (this is an architecture limitation). Like site1.co.uk, site1.es, site1.de etc.
But we need to parse a website with many languages, separated by url - like site2.com/en, site2.com/de, site2.com/es and so on.
I thought about MITMProxy: I could redirect all requests this way:
en.site2.com/* --> site2.com/en
de.site2.com/* --> site2.com/de
...
I have written a small script which simply takes URLs and rewrites them:
class MyMaster(flow.FlowMaster):
def handle_request(self, r):
url = r.get_url()
# replace URLs
if 'blabla' in url:
r.set_url(url.replace('something', 'another'))
But the target host generates 301 redirect with the response from the webserver - 'the page has been moved here' and the link to the site2.com/en
It worked when I played with URL rewriting, i.e. site2.com/en --> site2.com/de. But for different hosts (subdomain and the root domain, to be precise), it does not work.
I tried to replace the Host header in the handle_request method from above:
for key in r.headers.keys():
if key.lower() == 'host':
r.headers[key] = ['site2.com']
also I tried to replace the Referrer - all of that didn't help.
How can I finally spoof that request from the subdomain to the main domain? If it generates a HTTP(s) client warning it's ok since we need that for the scraper (and the warnings there can be turned off), not the real browser.
Thanks!
回答1:
You need to replace the content of the response and craft the header with just a few fields. Open a new connection to the redirected url and craft your response :
def handle_request(self, flow):
newUrl = <new-url>
retryCount = 3
newResponse = None
while True:
try:
newResponse = requests.get(newUrl) # import requests
except:
if retryCount == 0:
print 'Cannot reach new url ' + newUrl
traceback.print_exc() # import traceback
return
retryCount -= 1
continue
break
responseHeaders = Headers() # from netlib.http import Headers
if 'Date' in newResponse.headers:
responseHeaders['Date'] = str(newResponse.headers['Date'])
if 'Connection' in newResponse.headers:
responseHeaders['Connection'] = str(newResponse.headers['Connection'])
if 'Content-Type' in newResponse.headers:
responseHeaders['Content-Type'] = str(newResponse.headers['Content-Type'])
if 'Content-Length' in newResponse.headers:
responseHeaders['Content-Length'] = str(newResponse.headers['Content-Length'])
if 'Content-Encoding' in newResponse.headers:
responseHeaders['Content-Encoding'] = str(inetResponse.headers['Content-Encoding'])
response = HTTPResponse( # from libmproxy.models import HTTPResponse
http_version='HTTP/1.1',
status_code=200,
reason='OK',
headers=responseHeaders,
content=newResponse.content)
flow.reply(response)
来源:https://stackoverflow.com/questions/24886372/mitmproxy-smart-url-replacement