问题
I would like to find the shortest possible encoding for a string in the following form:
abbcccc = a2b4c
回答1:
[NOTE: this greedy algorithm does not guarantee shortest solution]
By remembering all previous occurrences of a character it is straight forward to find the first occurrence of a repeating string (minimal end index including all repetitions = maximal remaining string after all repetitions) and replace it with a RLE (Python3 code):
def singleRLE_v1(s):
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
return s_RLE
occ[c].append(idx)
return s # no repeating substring found
To make it robust for iterative application we have to exclude a few cases where a RLE may not be applied (e.g. '11' or '))'), also we have to make sure the RLE is not making the string longer (which can happen with a substring of two characters occurring twice as in 'abab'):
def singleRLE(s):
"find first occurrence of a repeating substring and replace it with RLE"
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if idx>0 and s[idx-1] in '0123456789': continue # no RLE for e.g. '11' or other parts of previous inserted RLE
if c == ')': continue # no RLE for '))...)'
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
print("found %d*'%s'" % (i,s_c))
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
return s_RLE
occ[c].append(idx)
return s # no repeating substring found
Now we can safely call singleRLE
on the previous output as long as we find a repeating string:
def iterativeRLE(s):
s_RLE = singleRLE(s)
while s != s_RLE:
print(s_RLE)
s, s_RLE = s_RLE, singleRLE(s_RLE)
return s_RLE
With the above inserted print
statements we get e.g. the following trace and result:
>>> iterativeRLE('xyabcdefdefabcdefdef')
found 2*'def'
xyabc2(def)abcdefdef
found 2*'def'
xyabc2(def)abc2(def)
found 2*'abc2(def)'
xy2(abc2(def))
'xy2(abc2(def))'
But this greedy algorithm fails for this input:
>>> iterativeRLE('abaaabaaabaa')
found 3*'a'
ab3abaaabaa
found 3*'a'
ab3ab3abaa
found 2*'b3a'
a2(b3a)baa
found 2*'a'
a2(b3a)b2a
'a2(b3a)b2a'
whereas one of the shortest solutions is 3(ab2a)
.
回答2:
Since a greedy algorithm does not work, some search is necessary. Here is a depth first search with some pruning (if in a branch the first idx0
characters of the string are not touched, to not try to find a repeating substring within these characters; also if replacing multiple occurrences of a substring do this for all consecutive occurrencies):
def isRLE(s):
"is this a well nested RLE? (only well nested RLEs can be further nested)"
nestCnt = 0
for c in s:
if c == '(':
nestCnt += 1
elif c == ')':
if nestCnt == 0:
return False
nestCnt -= 1
return nestCnt == 0
def singleRLE_gen(s,idx0=0):
"find all occurrences of a repeating substring with first repetition not ending before index idx0 and replace each with RLE"
print("looking for repeated substrings in '%s', first rep. not ending before index %d" % (s,idx0))
occ = dict() # for each character remember all previous indices of occurrences
for idx,c in enumerate(s):
if idx>0 and s[idx-1] in '0123456789': continue # sub-RLE cannot start after number
if not c in occ: occ[c] = []
for c_occ in occ[c]:
s_c = s[c_occ:idx]
if not isRLE(s_c): continue # avoid RLEs for e.g. '))...)'
if idx+len(s_c) < idx0: continue # pruning: this substring has been tried before
if c_occ-len(s_c) >= 0 and s[c_occ-len(s_c):c_occ] == s_c: continue # pruning: always take all repetitions
i = 1
while s[idx+(i-1)*len(s_c) : idx+i*len(s_c)] == s_c:
i += 1
if i > 1:
rle_pars = ('(',')') if len(s_c) > 1 else ('','')
rle = ('%d'%i) + rle_pars[0] + s_c + rle_pars[1]
if len(rle) <= i*len(s_c): # in case of a tie prefer RLE
s_RLE = s[:c_occ] + rle + s[idx+(i-1)*len(s_c):]
#print(" replacing %d*'%s' -> %s" % (i,s_c,s_RLE))
yield s_RLE,c_occ
occ[c].append(idx)
def iterativeRLE_depthFirstSearch(s):
shortestRLE = s
candidatesRLE = [(s,0)]
while len(candidatesRLE) > 0:
candidateRLE,idx0 = candidatesRLE.pop(0)
for rle,idx in singleRLE_gen(candidateRLE,idx0):
if len(rle) <= len(shortestRLE):
shortestRLE = rle
print("new optimum: '%s'" % shortestRLE)
candidatesRLE.append((rle,idx))
return shortestRLE
Sample output:
>>> iterativeRLE_depthFirstSearch('tctttttttttttcttttttttttctttttttttttct')
looking for repeated substrings in 'tctttttttttttcttttttttttctttttttttttct', first rep. not ending before index 0
new optimum: 'tc11tcttttttttttctttttttttttct'
new optimum: '2(tctttttttttt)ctttttttttttct'
new optimum: 'tctttttttttttc2(ttttttttttct)'
looking for repeated substrings in 'tc11tcttttttttttctttttttttttct', first rep. not ending before index 2
new optimum: 'tc11tc10tctttttttttttct'
new optimum: 'tc11t2(ctttttttttt)tct'
new optimum: 'tc11tc2(ttttttttttct)'
looking for repeated substrings in 'tc5(tt)tcttttttttttctttttttttttct', first rep. not ending before index 2
...
new optimum: '2(tctttttttttt)c11tct'
...
new optimum: 'tc11tc10tc11tct'
...
new optimum: 'tc11t2(c10t)tct'
looking for repeated substrings in 'tc11tc2(ttttttttttct)', first rep. not ending before index 6
new optimum: 'tc11tc2(10tct)'
...
new optimum: '2(tc10t)c11tct'
...
'2(tc10t)c11tct'
回答3:
Following is my C++ implementation to do it in-place with O(n)
time complexity and O(1)
space complexity.
class Solution {
public:
int compress(vector<char>& chars) {
int n = (int)chars.size();
if(chars.empty()) return 0;
int left = 0, right = 0, currCharIndx = left;
while(right < n) {
if(chars[currCharIndx] != chars[right]) {
int len = right - currCharIndx;
chars[left++] = chars[currCharIndx];
if(len > 1) {
string freq = to_string(len);
for(int i = 0; i < (int)freq.length(); i++) {
chars[left++] = freq[i];
}
}
currCharIndx = right;
}
right++;
}
int len = right - currCharIndx;
chars[left++] = chars[currCharIndx];
if(len > 1) {
string freq = to_string(len);
for(int i = 0; i < freq.length(); i++) {
chars[left++] = freq[i];
}
}
return left;
}
};
You need to keep track of three pointers - right
is to iterate, currCharIndx
is to keep track the first position of current character and left
is to keep track the write position of the compressed string.
来源:https://stackoverflow.com/questions/46876515/algorithm-for-simple-string-compression