Handling pound sign (#) in python requests

好久不见. 提交于 2021-02-11 06:16:07

问题


I'm using requests to compile a custom URL and one parameter includes a pound sign. Can anyone explain how to pass the parameter without encoding the pound sign?

This returns the correct CSV file

results_url = 'https://baseballsavant.mlb.com/statcast_search/csv?all=true&hfPT=&hfAB=&hfBBT=&hfPR=&hfZ=&stadium=&hfBBL=&hfNewZones=&hfGT=R%7C&hfC=&hfSea=2019%7C&hfSit=&player_type=batter&hfOuts=&opponent=&pitcher_throws=&batter_stands=&hfSA=&game_date_gt=&game_date_lt=&hfInfield=&team=&position=&hfOutfield=&hfRO=&home_road=&hfFlag=&hfPull=&metric_1=&hfInn=&min_pitches=0&min_results=0&group_by=name&sort_col=pitches&player_event_sort=h_launch_speed&sort_order=desc&min_abs=0&type=#results'
results = requests.get(results_url, timeout=30).content
results_df = pd.read_csv(io.StringIO(results.decode('utf-8')))

This DOES NOT

URL = 'https://baseballsavant.mlb.com/statcast_search/csv?'

def _get_statcast(params):

     _get = get(URL, params=params, timeout=30)
     _get.raise_for_status()
     return _get.content

The issue seems to be that when passing '#results' through requests anything after '#' gets ignored which causes the wrong CSV to be downloaded. If anyone has thoughts on other ways of going about this I would appreciate it.

EDIT2: Also asked this on the python forum https://python-forum.io/Thread-Handling-pound-sign-within-custom-URL?pid=75946#pid75946


回答1:


Basically, anything after a literal pound-sign in the URL is not sent to the server. This applies to browsers and requests.

The format of your URL suggests that the type=#results part is actually a query parameter.

requests will automatically encode the query parameters, while the browser won't. Below are various queries and what the server receives in each case:


URL parameter in the browser

When using the pound-sign in the browser, anything after the pond-sign is not sent to the server:

https://httpbin.org/anything/type=#results

Returns:

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Cache-Control": "max-age=0", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything/type="
}
  • The URL received by the server is https://httpbin.org/anything/type=.
  • The page being requested is called type= which does not seem to be correct.

Query parameter in the browser

The <key>=<value> format suggest it might be a query parameter which you are passing. Still, anything after the pound-sign is not sent to the server:

https://httpbin.org/anything?type=#results

Returns:

{
  "args": {
    "type": ""
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything?type="
}
  • The URL received by the server is https://httpbin.org/anything?type=.
  • The page being requested is called anything.
  • An argument type without a value is received.

Encoded query parameter in the browser

https://httpbin.org/anything?type=%23results

Returns:

{
  "args": {
    "type": "#results"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "en-GB,en;q=0.9,en-US;q=0.8,de;q=0.7", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "*redacted*"
  }, 
  "json": null, 
  "method": "GET", 
  "origin": "*redacted*", 
  "url": "https://httpbin.org/anything?type=%23results"
}
  • The URL received by the server is https://httpbin.org/anything?type=%23results.
  • The page being requested is called anything.
  • An argument type with a value of #results is received.

Python requests with URL parameter

requests will also not send anything after the pound-sign to the server:

import requests

r = requests.get('https://httpbin.org/anything/type=#results')
print(r.url)
print(r.json())

Returns:

https://httpbin.org/anything/type=#results
{
    "args": {},
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything/type="
}
  • The URL received by the server is https://httpbin.org/anything?type=.
  • The page being requested is called anything.
  • An argument type without a value is received.

Python requests with query parameter

requests automatically encodes query parameters:

import requests

r = requests.get('https://httpbin.org/anything', params={'type': '#results'})
print(r.url)
print(r.json())

Returns:

https://httpbin.org/anything?type=%23results
{
    "args": {
        "type": "#results"
    },
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything?type=%23results"
}
  • The URL received by the server is https://httpbin.org/anything?type=%23results.
  • The page being requested is called anything.
  • An argument type with a value of #results is received.

Python requests with doubly-encoded query parameter

If you manually encode the query parameter and then pass it to requests, it will encode the already encoded query parameter again:

import requests

r = requests.get('https://httpbin.org/anything', params={'type': '%23results'})
print(r.url)
print(r.json())

Returns:

https://httpbin.org/anything?type=%23results
{
    "args": {
        "type": "%23results"
    },
    "data": "",
    "files": {},
    "form": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.21.0"
    },
    "json": null,
    "method": "GET",
    "origin": "*redacted*",
    "url": "https://httpbin.org/anything?type=%2523results"
}
  • The URL received by the server is https://httpbin.org/anything?type=%2523results.
  • The page being requested is called anything.
  • An argument type with a value of %23results is received.



回答2:


I've only gotten through one trial but hopefully have a solution. Instead of passing "#results' through params I started a session with the base url+all other params, joined that with "#results' and then ran it through a 2nd get.

statcast_url = 'https://baseballsavant.mlb.com/statcast_search/csv?'
results_url = '&type=#results&'

def _get_statcast_results(params):

    s = session()
    _get = s.get(statcast_url, params=params, timeout=30, allow_redirects=True)

    new_url = _get.url+results_url
    data = s.get(new_url, timeout=30)

    return data.content

Still need to run through some more trials but I think this should work. Thanks to everyone who chimed in. Even though I didn't get a direct answer the responses still helped a ton.




回答3:


The answer by Cloudomation provides a lot of interesting information but I think it may not be what you are looking for. Assuming this identical thread in the python forum is written by you as well, read on:

From the information you provided it seems that type=#results is being used to filter the original csv and return only parts of the data.
If this is the case, the type= part is not really necessary (try the URL without it and see that you get the same results).

I'll explain:

The # symbol in URLS is called a fragment identifier and in different kinds of pages it serves different purposes. In text/csv pages, it serves to filter the csv table by column, row or some combination of the two. You can read more about it here.

In your case, results could be a query parameter that is used to filter the csv table in a custom way.

Unfortunately, as illustrated in Cloudomation's answer, the fragmented data is not available on the server side, so you will not be able to access it via a python request parameter in the way you tried.

You could try to access it in Javascript as suggested here or simply download the entire (unfiltered) CSV table and filter it yourself.

There are many ways to do this easily and efficiently in python. Look here for more information, or if you need more control you can import the CSV into a pandas dataframe.


EDIT:

I see you found a workaround by joining the strings and passing a second request. Since this works, you could probably get away with converting the params to string (as suggested here). If it does what you're after this would be a more efficient and perhaps slightly more elegant solution:

params = {'key1': 'value1', 'key2': 'value2'} // sample params dict

def _get_statcast_results(params):

    // convert params to string - alternatively you can  use %-formatting 
    params_str = "&".join(f"{k}={v}" for k,v in payload.items())

    s = session()

    data = s.get(statcast_url, params = params_str, timeout=30)

    return data.content



来源:https://stackoverflow.com/questions/55435400/handling-pound-sign-in-python-requests

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!