Shorten an already short string in Java

倖福魔咒の 提交于 2019-12-05 18:08:20

You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)

If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)

class Encoder {
  public static final int BASE = 40;
  StringBuilder chars = new StringBuilder(BASE);
  byte[] index = new byte[256];

  {
    chars.append('\0');
    for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
    for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
    chars.append("-:.");
    Arrays.fill(index, (byte) -1);
    for (byte i = 0; i < chars.length(); i++)
      index[chars.charAt(i)] = i;
  }

  public byte[] encode(String address) {
    try {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      DataOutputStream dos = new DataOutputStream(baos);
      for (int i = 0; i < address.length(); i += 3) {
        switch (Math.min(3, address.length() - i)) {
          case 1: // last one.
            byte b = index[address.charAt(i)];
            dos.writeByte(b);
            break;

          case 2:
            char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch);
            break;

          case 3:
            char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch2);
            break;
        }
      }
      return baos.toByteArray();
    } catch (IOException e) {
      throw new AssertionError(e);
    }
  }

  public static void main(String[] args) {
    Encoder encoder = new Encoder();
    for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
      System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
    }
  }
}

prints

twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.

I leave decoding as an exercise. ;)

First of all, IP addresses are designed to fit into 4 bytes and port numbers into 2. The ascii representation is only for humans to read, so it doesn't make sense to do compression on that.

Your idea for compressing domain name strings is doable.

Well in your case, I would use a specialized algo for your usecase. Recognize that you can store something other than strings. So for a IPv4 address : port, you would have a class that captured 6 bytes -- 4 for the address and 2 for the port. Another for type for apha-numeric hostnames. The port would always be stored in two bytes. The hostname part itself could also have specialized support for .com, for example. So a sample hierarchy may be:

    HostPort
       |
  +----+--------+
  |             |
IPv4        HostnamePort
                |
           DotComHostnamePort


public interface HostPort extends CharSequence { }


public HostPorts {
  public static HostPort parse(String hostPort) {
    ...
  }
}

In this case, the DotComHostnamePort allows you to drop .com from the host name and save 4 chars/bytes, depending on whether you store hostnames in punyform or in UTF16 form.

kapex

The first two bytes could contain the port number. If you always start with this fixed length port number, you don't need to include the separator :. Instead use a bit that indicates whether an IP address follows (see Karl Bielefeldt's solution) or a host name.

You could encode them using the CDC Display code. This encoding was used back in the old days when bits were scarce and programmers were nervous.

What you are suggesting is similar to base 64 encoding/decoding and there might be some mileage in looking at some of those implementations (base 64 encoding uses 6 bits).

As a starter if you use Apaches base 64 library

String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes()));
String y = new String(Base64.encodeBase64(x.getBytes()));
System.out.println("x = " + x);
System.out.println("y = " + y);

It will shorten your string by a few chars. This obviously does not work as what you end up with is not what you started with.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!