问题
I'm working on an automated webscrapper for a Restaurant website, but I'm having an issue. The said website uses cloudlfare's anti-bot security, which I would like to bypass, not the Under-Attack-Mode but a captcha test that only triggers when it detects a non-American IP or a bot. I'm trying to bypass it as cloudflare's security doesn't trigger when I clear cookies, disable javascript or when I use an American proxy.
Knowing this, I tried using python's requests library as such:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
response = requests.get("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers).text
print(response)
But this ends up triggering Cloudflare, no matter the proxy I use.
HOWEVER when using urllib.request with the same headers as such:
import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
request = urllib.request.Request("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers)
r = urllib.request.urlopen(request).read()
print(r.decode('utf-8'))
When ran with the same American IP, this time it does not trigger Cloudflare's security, even though it uses the same headers and IP used with the requests library.
So I'm trying to figure out what exactly is triggering cloudflare in the requests library that isn't in the urllib library.
While the typical answer would be "Just use urllib then", I'd like to figure out what exactly is different with requests, and how I could fix it, first off to understand how requests works and cloudflare detects bots, but also so that I may apply any fix I can find to other httplibs (notably asynchronous ones)
EDIT N°2: Progress so far:
Thanks to @TuanGeek we can now bypass the cloudflare block using requests as long as we connect directly to the host IP rather than the domain name (for some reason, the DNS redirection with requests triggers cloudflare, but urllib doesn't):
import requests
from collections import OrderedDict
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
s = requests.Session()
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", verify=False).text
To note: trying to access via http (rather than https with the verify variable set to False) will trigger cloudflare's block
Now this is great, but unfortunately my final goal of making this work asynchronously with the httplib HTTPX still isn't met, as using the following code, the cloudflare block is still triggered even though we're connecting directly through the Host IP, with proper headers, and with verify set to False:
import trio
import httpx
import socket
from collections import OrderedDict
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
async def asks_worker():
async with httpx.AsyncClient(headers=headers, verify=False) as s:
r = await s.get(f'https://{address}/guest/accountlogin')
print(r.text)
async def run_task():
async with trio.open_nursery() as nursery:
nursery.start_soon(asks_worker)
trio.run(run_task)
EDIT N°1: For additional details, here's the raw http request from urllib and from requests
REQUESTS:
send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 403 Forbidden\r\n'
header: Date: Thu, 02 Jul 2020 20:20:06 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: close
header: CF-Chl-Bypass: 1
header: Set-Cookie: __cfduid=df8902e0b19c21b364f3bf33e0b1ce1981593721256; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Expires: Thu, 01 Jan 1970 00:00:01 GMT
header: X-Frame-Options: SAMEORIGIN
header: cf-request-id: 03b2c8d09300000ca181928200000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=df8962e1b27c25b364f3bf66e8b1ce1981593723206; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Vary: Accept-Encoding
header: Server: cloudflare
header: CF-RAY: 5acb25c75c981ca1-EWR
URLLIB:
send: b'GET /guest/nologin/account-balance HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: grimaldis.myguestaccount.com\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0\r\nConnection: close\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Thu, 02 Jul 2020 20:20:01 GMT
header: Content-Type: text/html;charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Set-Cookie: __cfduid=db9de9687b6c22e6c12b33250a0ded3251292457801; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Expires: Thu, 2 Jul 2020 20:20:01 GMT
header: Cache-Control: no-cache, private, no-store
header: X-Powered-By: Undertow/1
header: Pragma: no-cache
header: X-Frame-Options: SAMEORIGIN
header: Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://use.typekit.net connect.facebook.net/ https://googleads.g.doubleclick.net/ app.pendo.io cdn.pendo.io pendo-static-6351154740266000.storage.googleapis.com pendo-io-static.storage.googleapis.com https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.google.com/recaptcha/api.js apis.google.com https://www.googletagmanager.com api.instagram.com https://app-rsrc.getbee.io/plugin/BeePlugin.js https://loader.getbee.io api.instagram.com https://bat.bing.com/bat.js https://www.googleadservices.com/pagead/conversion.js https://connect.facebook.net/en_US/fbevents.js https://connect.facebook.net/ https://fonts.googleapis.com/ https://ssl.gstatic.com/ https://tagmanager.google.com/;style-src 'unsafe-inline' *;img-src * data:;connect-src 'self' app.pendo.io api.feedback.us.pendo.io; frame-ancestors 'self' app.pendo.io pxsweb.com *.pxsweb.com;frame-src 'self' *.myguestaccount.com https://app.getbee.io/ *;
header: X-Lift-Version: Unknown Lift Version
header: CF-Cache-Status: DYNAMIC
header: cf-request-id: 01b2c5b1fa00002654a25485710000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Server: cloudflare
header: CF-RAY: 5acb58a62c5b5144-EWR
回答1:
This really peeked my interests. The requests
solution that I was able to get working.
Solution
Finally narrow down the problem. When you use requests it uses urllib3 connection pool. There seems to be some inconsistency between a regular urllib3 connection and a connection pool. A working solution:
import requests
from collections import OrderedDict
from requests import Session
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
s = Session()
headers = OrderedDict({
'Accept-Encoding': 'gzip, deflate, br',
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)
Technical Background
So I ran both method through Burp Suite to compare the requests. Below are the raw dumps of the requests
using requests
GET /guest/accountlogin HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Connection: close
Host: grimaldis.myguestaccount.com
Accept-Language: en-GB,en;q=0.5
Upgrade-Insecure-Requests: 1
dnt: 1
using urllib
GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: close
Upgrade-Insecure-Requests: 1
Dnt: 1
The difference is the ordering of the headers. The difference in the dnt
capitalization is not actually the problem.
So I was able to make a successful request with the following raw request:
GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
So the Host
header has be sent above User-Agent
. So if you want to continue to to use requests. Consider using a OrderedDict to ensure the ordering of the headers.
回答2:
After some debugging, and thanks to the answers of @TuanGeek, we've found out the issue with the requests library seems to come from a DNS issue on requests' part when dealing with cloudflare, a simple fix to this issue is connecting directly to the host IP as such:
import requests
from collections import OrderedDict
from requests import Session
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
s = Session()
headers = OrderedDict({
'Accept-Encoding': 'gzip, deflate, br',
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)
Now, this fix didn't work when working with the httplib HTTPX, However I've found where the issue stems from.
The issue comes from the h11 library (used by HTTPX to handle HTTP/1.1 requests), while urllib would automatically fix the letter case of headers, h11 took a different approach by lowercasing every header. While in theory this shouldn't cause any issues, as servers should handle headers in a case-insensitive manner (and in a lot of cases they do), the reality is that HTTP is Hard™️ and services such as Cloudflare don't respect RFC2616 and requires headers to be properly capitalized.
Discussions about capitalization have been going for a while over at h11:
https://github.com/python-hyper/h11/issues/31
And have "recently" started to pop up over on HTTPX's repo as well:
https://github.com/encode/httpx/issues/538
https://github.com/encode/httpx/issues/728
Now the unsatisfactory answer to the issue between Cloudflare and HTTPX is that until something is done over on h11's side (or until Cloudflare miraculously starts respecting RFC2616), not much can be changed to how HTTPX and Cloudflare handle header capitalization.
Either use a different HTTPLIB such as aiohttp or requests-futures, try forking and patching the header capitalization with h11 yourself, or wait and hope for the issue to be dealt with properly by the h11 team.
来源:https://stackoverflow.com/questions/62684468/pythons-requests-triggers-cloudflares-security-while-urllib-does-not