问题
I'm using requests to compile a custom URL and one parameter includes a pound sign. Can anyone explain how to pass the parameter without encoding the pound sign?
This returns the correct CSV file
results_url = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2019%7C&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=#results'
results = requests.get(results_url, timeout=30).content
results_df = pd.read_csv(io.StringIO(results.decode('utf-8')))
This DOES NOT
URL = 'https://baseballsavant.mlb.com/statcast_search/csv?'
def _get_statcast(params):
_get = get(URL, params=params, timeout=30)
_get.raise_for_status()
return _get.content
The issue seems to be that when passing '#results' through requests anything after '#' gets ignored which causes the wrong CSV to be downloaded. If anyone has thoughts on other ways of going about this I would appreciate it.
EDIT2: Also asked this on the python forum https://python-forum.io/Thread-Handling-pound-sign-within-custom-URL?pid=75946#pid75946
回答1:
Basically, anything after a literal pound-sign in the URL is not sent to the server. This applies to browsers and requests
.
The format of your URL suggests that the type=#results
part is actually a query parameter.
requests
will automatically encode the query parameters, while the browser won't. Below are various queries and what the server receives in each case:
URL parameter in the browser
When using the pound-sign in the browser, anything after the pond-sign is not sent to the server:
https://httpbin.org/anything/type=#results
Returns:
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7",
"Cache-Control": "max-age=0",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "*redacted*"
},
"json": null,
"method": "GET",
"origin": "*redacted*",
"url": "https://httpbin.org/anything/type="
}
- The URL received by the server is
https://httpbin.org/anything/type=
. - The page being requested is called
type=
which does not seem to be correct.
Query parameter in the browser
The <key>=<value>
format suggest it might be a query parameter which you are passing. Still, anything after the pound-sign is not sent to the server:
https://httpbin.org/anything?type=#results
Returns:
{
"args": {
"type": ""
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "*redacted*"
},
"json": null,
"method": "GET",
"origin": "*redacted*",
"url": "https://httpbin.org/anything?type="
}
- The URL received by the server is
https://httpbin.org/anything?type=
. - The page being requested is called
anything
. - An argument
type
without a value is received.
Encoded query parameter in the browser
https://httpbin.org/anything?type=%23results
Returns:
{
"args": {
"type": "#results"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "*redacted*"
},
"json": null,
"method": "GET",
"origin": "*redacted*",
"url": "https://httpbin.org/anything?type=%23results"
}
- The URL received by the server is
https://httpbin.org/anything?type=%23results
. - The page being requested is called
anything
. - An argument
type
with a value of#results
is received.
Python requests with URL parameter
requests
will also not send anything after the pound-sign to the server:
import requests
r = requests.get('https://httpbin.org/anything/type=#results')
print(r.url)
print(r.json())
Returns:
https://httpbin.org/anything/type=#results
{
"args": {},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"json": null,
"method": "GET",
"origin": "*redacted*",
"url": "https://httpbin.org/anything/type="
}
- The URL received by the server is
https://httpbin.org/anything?type=
. - The page being requested is called
anything
. - An argument
type
without a value is received.
Python requests with query parameter
requests
automatically encodes query parameters:
import requests
r = requests.get('https://httpbin.org/anything', params={'type': '#results'})
print(r.url)
print(r.json())
Returns:
https://httpbin.org/anything?type=%23results
{
"args": {
"type": "#results"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"json": null,
"method": "GET",
"origin": "*redacted*",
"url": "https://httpbin.org/anything?type=%23results"
}
- The URL received by the server is
https://httpbin.org/anything?type=%23results
. - The page being requested is called
anything
. - An argument
type
with a value of#results
is received.
Python requests with doubly-encoded query parameter
If you manually encode the query parameter and then pass it to requests
, it will encode the already encoded query parameter again:
import requests
r = requests.get('https://httpbin.org/anything', params={'type': '%23results'})
print(r.url)
print(r.json())
Returns:
https://httpbin.org/anything?type=%23results
{
"args": {
"type": "%23results"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.21.0"
},
"json": null,
"method": "GET",
"origin": "*redacted*",
"url": "https://httpbin.org/anything?type=%2523results"
}
- The URL received by the server is
https://httpbin.org/anything?type=%2523results
. - The page being requested is called
anything
. - An argument
type
with a value of%23results
is received.
回答2:
I've only gotten through one trial but hopefully have a solution. Instead of passing "#results' through params I started a session with the base url+all other params, joined that with "#results' and then ran it through a 2nd get.
statcast_url = 'https://baseballsavant.mlb.com/statcast_search/csv?'
results_url = '&type=#results&'
def _get_statcast_results(params):
s = session()
_get = s.get(statcast_url, params=params, timeout=30, allow_redirects=True)
new_url = _get.url+results_url
data = s.get(new_url, timeout=30)
return data.content
Still need to run through some more trials but I think this should work. Thanks to everyone who chimed in. Even though I didn't get a direct answer the responses still helped a ton.
回答3:
The answer by Cloudomation provides a lot of interesting information but I think it may not be what you are looking for. Assuming this identical thread in the python forum is written by you as well, read on:
From the information you provided it seems that type=#results
is being used to filter the original csv and return only parts of the data.
If this is the case, the type=
part is not really necessary (try the URL without it and see that you get the same results).
I'll explain:
The #
symbol in URLS is called a fragment identifier and in different kinds of pages it serves different purposes. In text/csv
pages, it serves to filter the csv table by column, row or some combination of the two. You can read more about it here.
In your case, results
could be a query parameter that is used to filter the csv table in a custom way.
Unfortunately, as illustrated in Cloudomation's answer, the fragmented data is not available on the server side, so you will not be able to access it via a python request parameter in the way you tried.
You could try to access it in Javascript as suggested here or simply download the entire (unfiltered) CSV table and filter it yourself.
There are many ways to do this easily and efficiently in python. Look here for more information, or if you need more control you can import the CSV into a pandas dataframe.
EDIT:
I see you found a workaround by joining the strings and passing a second request. Since this works, you could probably get away with converting the params to string (as suggested here). If it does what you're after this would be a more efficient and perhaps slightly more elegant solution:
params = {'key1': 'value1', 'key2': 'value2'} // sample params dict
def _get_statcast_results(params):
// convert params to string - alternatively you can use %-formatting
params_str = "&".join(f"{k}={v}" for k,v in payload.items())
s = session()
data = s.get(statcast_url, params = params_str, timeout=30)
return data.content
来源:https://stackoverflow.com/questions/55435400/handling-pound-sign-in-python-requests