How to compress URL parameters

后端 未结 12 1491
孤城傲影
孤城傲影 2020-12-04 11:16

Say I have a single-page application that uses a third party API for content. The app’s logic is in-browser only, and there is no backend I can write to.

To allow de

12条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-04 11:58

    I have a cunning plan! (And a drink of gin tonic)

    You doesn't seem to care about the length of the bytestream but of the length of the resulting glyphs, e.g. what the string which is displayed to the user.

    Browser are pretty good in converting an IRI to the underlying [URI][2] while still displaying the IRI in the address bar. IRIs have a greater repertoire of possible characters while your set of possible chars is rather limited.

    That means you can encode bigrams of your chars (aa, ab, ac, …, zz & special chars) into one char of the full unicode spectrum. Say you've got 80 possible ASCII chars: the number of possible combinations of two chars is 6400. Which are easy findable in Unicodes assigned chars, e.g. in the han unified CJK spectrum:

    aa  →  一
    ab  →  丁
    ac  →  丂
    ad  →  七
    …
    

    I picked CJK because this is only (slighty) reasonable if the target chars are assigned in unicode and have assigned glyphs on the major browser and operating systems. For that reason the private use area is out and the more efficient version using trigrams (whose possible combinations could use all of Unicodes 1114112 possible code points) are out.

    To recap: the underlying bytes are still there and – given UTF-8 encoding – possible even longer, but the string of displayed characters the user sees and copies is 50% shorter.

    Ok, Ok, reasons, why this solution is insane:

    • IRIs are not perfect. A lot of lesser tools than modern browser have their problems.

    • The algorithm needs obviously a lot of more work. You'll need a function which maps the bigrams to the target chars and back. And it should preferable work arithmetically to avoid big hash tables in memory.

    • The target chars should be checked if they are assigned and if they are simple chars and not fancy unicodian things like combining chars or stuff that got lost somewhere in Unicode normalization. Also if the target area is an continuous span of assigned chars with glyphs.

    • Browser are sometimes wary of IRIs. For good reason, given the IDN homograph attacks. Are they OK with all these non-ASCII-chars in their address bar?

    • And the biggest: people are notoriously bad at remembering characters in scripts they don't know. They are even worse at trying to (re)-type these chars. And copy'n'paste can go wrong in many different clicks. There is a reason URL shorteners use Base64 and even smaller alphabets.

    … speaking of which: That would be my solution. Offloading the work of shortening links either to the user or integrating goo.gl or bit.ly via their APIs.

提交回复
热议问题