Really simple short string compression

后端 未结 9 1291
感动是毒
感动是毒 2020-12-01 16:21

Is there a really simple compression technique for strings up to about 255 characters in length (yes, I\'m compressing URLs)?

I am not concerned with the strength o

相关标签:
9条回答
  • 2020-12-01 16:26

    You can use deflate algorithm directly, without any headers checksums or footers, as described in this question: Python: Inflate and Deflate implementations

    This cuts down a 4100 character URL to 1270 base64 characters, in my test, allowing it to fit inside IE's 2000 limit.

    And here's an example of a 4000-character URL, which can't be solved with a hashtable since the applet can exist on any server.

    0 讨论(0)
  • 2020-12-01 16:27

    Have you tried just using gzip?

    No idea if it would work effectively with such short strings, but I'd say its probably your best bet.

    0 讨论(0)
  • 2020-12-01 16:37

    I have just created a compression scheme that targets URLs and achieves around 50% compression (compared to base64 representation of the original URL text).

    see http://blog.alivate.com.au/packed-url/


    It would be great if someone from a big tech company built this out properly and published it for all to use. Google championed Protocol buffers. This tool can save a lot of disk space for someone like Google, while still being scannable. Or perhaps the great captain himself? https://twitter.com/capnproto

    Technically, I would call this a binary (bitwise) serialisation scheme for the data that underlies a URL. Treat the URL as text-representation of conceptual data, then serialize that conceptual data model with a specialised serializer. The outcome is a more compressed version of the original of course. This is very different to how a general-purpose compression algorithm works.

    0 讨论(0)
  • 2020-12-01 16:37

    What's your goal?

    • A shorter URL? Try URL shorteners like http://tinyurl.com/ or http://is.gd/
    • Storage space? Check out System.IO.Compression. (Or SharpZipLib)
    0 讨论(0)
  • 2020-12-01 16:38

    As suggested in the accepted answer, Using data compression does not work to shorten URL paths that are already fairly short.

    DotNetZip has a DeflateStream class that exposes a static (Shared in VB) CompressString method. It's a one-line way to compress a string using DEFLATE (RFC 1951). The DEFLATE implementation is fully compatible with System.IO.Compression.DeflateStream, but DotNetZip compresses better. Here's how you might use it:

    string[] orig = {
        "folder1/folder2/page1.aspx",
        "folderBB/folderAA/page2.aspx",
    };
    public void Run()
    {
        foreach (string s in orig)
        {
            System.Console.WriteLine("original    : {0}", s);
            byte[] compressed = DeflateStream.CompressString(s);
            System.Console.WriteLine("compressed  : {0}", ByteArrayToHexString(compressed));
            string uncompressed = DeflateStream.UncompressString(compressed);
            System.Console.WriteLine("uncompressed: {0}\n", uncompressed);
        }
    }
    

    Using that code, here are my test results:

    original    : folder1/folder2/page1.aspx
    compressed  : 4bcbcf49492d32d44f03d346fa0589e9a9867a89c5051500
    uncompressed: folder1/folder2/page1.aspx
    
    original    : folderBB/folderAA/page2.aspx
    compressed  : 4bcbcf49492d7272d24f03331c1df50b12d3538df4128b0b2a00
    uncompressed: folderBB/folderAA/page2.aspx
    

    So you can see the "compressed" byte array, when represented in hex, is longer than the original, about 2x as long. The reason is that a hex byte is actually 2 ASCII chars.

    You could compensate somewhat for that by using base-62, instead of base-16 (hex) to represent the number. In that case a-z and A-Z are also digits, giving you 0-9 (10) + a-z (+26) + A-Z (+26) = 62 total digits. That would shorten the output significantly. I haven't tried that. yet.


    EDIT
    Ok I tested the Base-62 encoder. It shortens the hex string by about half. I figured it would cut it to 25% (62/16 =~ 4) But I think I am losing something with the discretization. In my tests, the resulting base-62 encoded string is about the same length as the original URL. So, no, using compression and then base-62 encoding is still not a good approach. you really want a hash value.

    0 讨论(0)
  • 2020-12-01 16:43

    I would start with trying one of the existing (free or open source) zip libraries, e.g. http://www.icsharpcode.net/OpenSource/SharpZipLib/

    Zip should work well for text strings, and I am not sure if it is worth implementing a compression algorithm yourserlf....

    0 讨论(0)
提交回复
热议问题