save url as a file name in python

后端 未结 5 865
暖寄归人
暖寄归人 2021-01-06 07:32

Firstly, I\'m pretty new in python, please leave a comment as well if you consider to down vote

I have a url such as

http://example.com/here/there/         


        
5条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-01-06 07:54

    This is a bad idea as you will hit 255 byte limit for filenames as urls tend to be very long and even longer when b64encoded!

    You can compress and b64 encode but it won't get you very far:

    from base64 import b64encode 
    import zlib
    import bz2
    from urllib.parse import quote
    
    def url_strategies(url):
        url = url.encode('utf8')
        print(url.decode())
        print(f'normal  : {len(url)}')
        print(f'quoted  : {len(quote(url, ""))}')
        b64url = b64encode(url)
        print(f'b64     : {len(b64url)}')
        url = b64encode(zlib.compress(b64url))
        print(f'b64+zlib: {len(url)}')
        url = b64encode(bz2.compress(b64url))
        print(f'b64+bz2: {len(url)}')
    

    Here's an average url I've found on angel.co:

    
    URL = 'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'
    

    And even with b64+zlib it doesn't fit into 255 limit:

    normal  : 316
    quoted  : 414
    b64     : 424
    b64+zlib: 304
    b64+bz2 : 396
    

    Even with the best strategy of zlib compression and b64encode you'd still be in trouble.

    Proper Solution

    Alternatively what you should do is hash the url and attach url as file attribute to the file:

    import os
    from hashlib import sha256
    
    def save_file(url, content, char_limit=13):
        # hash url as sha256 13 character long filename
        hash = sha256(url.encode()).hexdigest()[:char_limit]
        filename = f'{hash}.html'
        # 93fb17b5fb81b.html
        with open(filename, 'w') as f:
            f.write(content)
        # set url attribute
        os.setxattr(filename, 'user.url', url.encode())
    

    and then you can retrieve the url attribute:

    print(os.getxattr(filename, 'user.url').decode())
    'https://angel.co/job_listings/browse_startups_table?startup_ids%5B%5D=972887&startup_ids%5B%5D=365478&startup_ids%5B%5D=185570&startup_ids%5B%5D=32624&startup_ids%5B%5D=134966&startup_ids%5B%5D=722477&startup_ids%5B%5D=914250&startup_ids%5B%5D=901853&startup_ids%5B%5D=637842&startup_ids%5B%5D=305240&tab=find&page=1'
    

    note: setxattr and getxattr require user. prefix in python
    for file attributes in python see related issue here: https://stackoverflow.com/a/56399698/3737009

提交回复
热议问题