How do I remove a query from a url?

[亡魂溺海] 提交于 2019-12-03 11:08:40

See urllib.urlparse

Example code:

from urlparse import urlparse
o = urlparse('http://url.something.com/bla.html?querystring=stuff')

url_without_query_string = o.scheme + "://" + o.netloc + o.path

Example output:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urlparse import urlparse
>>> o = urlparse('http://url.something.com/bla.html?querystring=stuff')
>>> url_without_query_string = o.scheme + "://" + o.netloc + o.path
>>> print url_without_query_string
http://url.something.com/bla.html
>>> 

There is a function url_query_cleaner in w3lib.url module (used by scrapy itself) to clean urls keeping only a list of allowed arguments.

Provide some code, so we can help you.

If you are using CrawlSpider and Rule's with SgmlLinkExtractor, provide custom function to proccess_value parameter of SgmlLinkExtractor constructor.

See documentation for BaseSgmlLinkExtractor

def delete_random_garbage_from_url(url):
    cleaned_url = ... # process url somehow
    return cleaned_url

Rule(
    SgmlLinkExtractor(
         # ... your allow, deny parameters, etc
         process_value=delete_random_garbage_from_url,
    )
)

You can use the urllib.parse.urlsplit() function. The result is a structured parse result, a named tuple with added functionality.

Use the namedtuple._replace() method to alter the parsed result values, then use the SplitResult.geturl() method to get a URL string again.

To remove the query string, set the query value to None:

from urllib.parse import urlsplit

updated_url = urlsplit(url)._replace(query=None).geturl()

Demo:

>>> from urllib.parse import urlsplit
>>> url = 'https://example.com/example/path?query_string=everything+after+the+questionmark'
>>> urlparse.urlsplit(url)._replace(query=None).geturl()
'https://example.com/example/path'

For Python 2, the same function is available under the urlparse.urlsplit() name.

You could also use the urllparse.parse.urlparse() function; for URLs without any path parameters, the result would be the same. The two functions differ in how path parameters are handled; urlparse() only supports path parameters for the last segment of the path, while urlsplit() leaves path parameters in place in the path, leaving parsing of such parameters to other code. Since path parameters are rarely used these days [later URL RFCs have dropped the feature altogether), the difference is academical. urlparse() uses urlsplit() and without parameters, doesn't add anything other than extra overhead. It is better to just use urlsplit() directly.

use this method to remove query string from url

urllink="http://url.something.com/bla.html?querystring=stuff"
url_final=urllink.split('?')[0]
print(url_final)

output will be: http://url.something.com/bla.html

If you are using BaseSpider, before yielding a new request, remove manually random values from the query part of the URL using urlparse:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    item_urls = hxs.select(".//a[@class='...']/@href").extract()
    for item_url in item_urls:
        # remove the bad part of the query part of the URL here
        item_url = urlparse.urljoin(response.url, item_url)
        self.log('Found item URL: %s' % item_url)
        yield Request(item_url, callback = self.parse_item)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!