calculating Lempel-Ziv (LZ) complexity (aka sequence complexity) of a binary string

喜你入骨 提交于 2019-12-11 17:17:52

问题


I need to calculate the LZ-complexity of a binary string. The LZ-complexity is the number of differencet substrings encountered as the stream is viewed from begining to the end. As an example:

s = 1001111011000010

Marking in the different substrings the sequence complexity c(s) = 6: s = 1 / 0 / 01 / 1110 / 1100 / 0010 /

can someone guide me to find a simple solution for that? I am sure there should be some very straight-forward implementations for this well-known problem, but I have difficulty finding them. Can it be done simply done with constructing a suffix tree or something similar. If yes, exactly how? and what should I do?

anyone knows of any c/c++ source code to accomplish the task?

thanks in advance.

to clarify the construction of the tree suggested in the answers. Does the tree looks like this?

         o
       /   \
      o     o
     / \   / \
    o   o o   o
       /     /
      o     o

回答1:


Below is a quick example of how to compute LZ-Complexity using a tree. For convenience - mine; not yours - this code implements a fixed-sized pre-allocated tree, and is a prime example of why void* pointers are ugly to use and difficult to maintain. Hand this code in as is, and your lecturer is likely to shoot you in the face :)

#include <stdlib.h>
#include <stdio.h>

int LZComplexity(char *p_binarySequence, int p_maxTreeNodes)
{
 void **patternTree;
 void **currentNode;
 void **nextFreeNode;
 int nodeCount;
 int sequenceIndex;
 int currentDigit;

 nodeCount = 0;
 patternTree = malloc(sizeof(void*) * (p_maxTreeNodes << 1));
 currentNode = patternTree;
 nextFreeNode = patternTree + (sizeof(void*) << 1);
 currentNode[0] = NULL;
 currentNode[1] = NULL;
 sequenceIndex = 0;

 while (p_binarySequence[sequenceIndex])
 {
  currentDigit = p_binarySequence[sequenceIndex] - 48;
  if (NULL == currentNode[currentDigit])
  {
   currentNode[currentDigit] = nextFreeNode;
   nextFreeNode[0] = NULL;
   nextFreeNode[1] = NULL;
   nextFreeNode += (sizeof(void*) << 1);
   currentNode = patternTree;
   nodeCount++;
  }
  else
  {
   currentNode = currentNode[currentDigit];
  }
  sequenceIndex++;
 }

 free(patternTree);
 return nodeCount;
}


int main(int argc, char *argv[])
{
 printf("%u\n", LZComplexity("10100101001011101011", 1000));
 return 0;
}



回答2:


1 0 01 11 10 110 00 010
Complexity of sequence is 8 because the partitions are 8 not 6 - 1/0/01/11/10/110/00/010




回答3:


@Arash and @Sanchit Gupta: You might've got confused between LZ76 complexity and LZ78 complexity. The one Arash is refering to is LZ76 complexity and the other one is LZ78 complexity. You can refer to section-3 of the paper "Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity".




回答4:


Create a binary tree where left is 0 and right is 1. For each bit, try to find the sequence in the tree. If it's there, concatenate the next bit, rinse, repeat. If it's not there, add it to the tree and continue. LZ Complexity is the total number of paths in the tree (not just # leaf nodes).

By the way, is this homework?




回答5:


This should do the trick in Python (from: Kaspar, F. Schuster, H. Easily calculable measure for the complexity of spatiotemporal patterns. Physical Review A, vol 36, n. 2, p 842.)

#!/usr/bin/python


def lz_complexity(s):
    i, k, l = 0, 1, 1
    k_max = 1
    n = len(s) - 1
    c = 1
    while True:
        if s[i + k - 1] == s[l + k - 1]:
            k = k + 1
            if l + k >= n - 1:
                c = c + 1
                break
        else:
            if k > k_max:
               k_max = k
            i = i + 1
            if i == l:
                c = c + 1
                l = l + k_max
                if l + 1 > n:
                    break
                else:
                    i = 0
                    k = 1
                    k_max = 1
            else:
                k = 1
    return c


def main():
    lz = lz_complexity('1001111011000010')
    assert lz == 6 
    print lz


if __name__ == '__main__':

    main()



回答6:


This may be relevant for you. It is parallel implementation of an algorithm LZMP that computes the LZ-complexity in CUDA and runs on nVidia GPU.

http://www.ariel.ac.il/sites/ratsaby/Code/LZMP.zip



来源:https://stackoverflow.com/questions/4946695/calculating-lempel-ziv-lz-complexity-aka-sequence-complexity-of-a-binary-str

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!