Glass beads - How does Suffix array applied here?

前提是你 提交于 2021-01-28 05:34:05

问题


The problem statement for this problem can be found at this link - https://uva.onlinejudge.org/index.php?option=com_onlinejudge&Itemid=8&category=24&page=show_problem&problem=660. When I first read the problem I just could not visualize how suffix array concept is applied in this question. I read the code from this link - https://yuting-zhang.github.io/uva/2016/03/22/UVa-719.html. If some one can take one small example and help me with the complete trace applying Suffix array and LCP concepts would be really helpful.

ALso I didn't get the meaning of this line in the code in the link I have mentioned :

What is this assignment doing - LCP[i + 1] == n - 1 - SA[i] ?

for (int i = 0; i < n; i++)
            if (SA[i] < (n >> 1)){
                if (i + 1 < n && SA[i + 1] < (n >> 1) && **LCP[i + 1] == n - 1 - SA[i]**) -- 
                    continue;
                printf("%d\n", SA[i] + 1);
                break;
            }

回答1:


Let us first understand the concepts that are used:

  • Suffix Array: For a string s of n characters, the suffix array of s holds the n+1 possible suffixes of s sorted in lexicographic order. For a compact representation, we can store for each suffix only the starting position of the suffix in s.

  • Longest-Common-Prefix Array: Holds for each two consecutive pairs in the suffix array the size of the longest common prefix of the two suffixes.

For the glass-beads problem we get an input string representing the circular glass-beads chain. We are asked to find the "weakest link", meaning the position of the bead such that cutting the chain just before that bead and considering the string starting at that bead and going around the chain up to the cut, is the lexicographically minimal among all possible cuts. When there are multiple possible solutions, we are asked to return the cut occurring earliest in the input string.

Consider the examples from the link you gave:

  • helloworld: The weakest link is at position 10. This means we cut right before d, yielding the new string dhelloworl. We can immediately see that there can be no better position to cut, because d is the smallest letter appearing in the input string. Therefore, no other cut can generate a new string that is lexicographically smaller than our string starting with d. We see that the problem is trivial when the string has a unique smallest character, such as the d in this case.

  • amandamanda: The weakest link is at position 11. This means we cut right before the last a, yielding the new string aamandamand. Again it is clear that there can be no better place to cut, because no other cut can generate a string starting with aa, therefore no other cut can generate a string that is lexicographically smaller than aamandamand.

  • dontcallmebfu: The weakest link is at position 6. This means we cut right before the a, yielding the new string allmebfudontc. It is immediately clear that this is the only possible solution, because a is the unique smallest character in the input string.

  • aaabaaa: The weakest link is at position 5. This means we cut right before the a that follows the b, yielding the new string aaaaaab. We can see that this is the best position to cut because it generates the longest possible sequence of as before the b occurs. Therefore this new string is lexicographically smallest among new strings generated by all possible cuts.

Let us now consider how we can apply SA/LCP to this problem: As noted in your second link, the approach is to construct SA/LCP for the doubled input string. Doubling the input string means concatenating two copies of the input string. Why would we do this? It allows us to simulate the circularity of the glass-bead chain. Consider again the example helloworld (size 10). When we double the input string, we get helloworldhelloworld. When we cut before the d in the input string, we can now read the string generated by the cut, by moving forward 10 characters from the cut in the doubled string: helloworl|dhelloworl|d.

If we look carefully, we can see that we actually never need the last character in the second copy. The only way of reaching the second d would be by cutting after the first d. But we would never cut after the first d, because that would be equivalent to cutting at the beginning of the first copy. So as a small optimization, we can omit the last character of the second copy, which is done in the code from your second link, when the loop for creating the second copy only goes up to n-1 (with n being the length of the input string):

        for (int i = 0; i < n - 1; i++)
            st[i + n] = st[i];

After having constructed the SA of the doubled input string, the lexicographically smallest suffix should be at the beginning (because the SA is sorted lexicographically by definition). However, we have to keep in mind the SA also contains suffixes that start in the second half of the doubled input string. But these are not useful to us, because only the cuts in the first half of the doubled input string simulate the wrap-around on the circular bead-chain by being followed with the second copy, whereas the second copy is followed by nothing.

Consider for example the SA of helloworldhelloworl:

  • 19: $ (empty)
  • 9: dhelloworl$
  • 11: elloworl$
  • 1: elloworldhelloworl$
  • (... I'm omitting the full array, because we are only interested in the beginning...)

We can see that the lexicographically smallest item is the empty word $. But this cut is not useful to us, because it happens in the second half of the doubled input string. When we project this cut back onto the input string, we would have to cut right after the last bead, or equivalently, right before the first bead (because of the circularity of the glass-bead chain). Therefore, the new string generated by this cut would be helloworld, which is not actually lexicographically smaller than the second entry in the SA, dhelloworl. For this reason, we have to skip the SA entries starting in the second half of the double input string. In the code you linked, this check is implemented by

if (SA[i] < (n >> 1)){

Where n is two times the size of the original string (the minus one from omitting the last character in the second copy cancels out because we insert the $ as the terminal character). The >> 1 is a binary left shift by one position, which is equivalent to dividing by two. So this check ensures only cuts in the first half of the doubled string are considered.

As the problem has the additional constraint that when there are multiple solutions, the one appearing earliest in the input string should be returned, we have to apply additional filtering to the SA entries. Consider the example from your second link. We have the input AAA and double it (minus one) to AAAAA. The suffix array would then be:

- 5: `$` (empty word)
- 4: `A$`
- 3: `AA$`
- 2: `AAA$`
- 1: `AAAA$`
- 0: `AAAAA$`

The cuts at 5, 4, 3 are skipped due to the previous check because the occur in the second half of the doubled input string. For the remaining entries, the cut at 2 would be lexicographically smallest according to the SA. However, when projecting these cuts back to the original input, the cuts 2, 1, 0 all generate the same string AAA. So these three cuts are all lexicographically equivalent, and among them, the cut at 0 is the earliest. Therefore 0 would be the right answer to return.

Therefore, we have to skip an entry in the SA if it is immediately followed by another entry that generates the same result string on the original bead-chain. We can check this by using the LCP array, which tells us how long the common prefix of the current entry and its successor is. If the length of the common prefix is equal to the size of the current suffix, the current suffix is completely contained in the successor entry. This is realized by the check

if (i + 1 < n && SA[i + 1] < (n >> 1) && LCP[i + 1] == n - 1 - SA[i])
  continue;

The individual parts of the check mean:

  • i+1 < n: We are not at the end of the SA, so there is a successor we can check.
  • SA[i + 1] < (n >> 1): The successor is not in the second half of the doubled input string.
  • LCP[i + 1] == n - 1 - SA[i]: Recall that SA[i] gives the starting position of the ith suffix in the SA. The SA was constructed on the doubled input string, having size n. Hence n-1-SA[i] is the distance between the start of the suffix and the end of the string, hence the length of the suffix. If you look at the AAA example above, the doubled string would be AAAAA$ (with the unreachable Ain the second copy removed, because we are only interested in suffixes starting in the first half and having length at most 3). n would be 2*3=6. The suffix AAA$ is starting at position 2. Then n-1-2 = 3 is the length of the suffix (ignoring the $, which is the same for all suffixes). So the check says "the length of the ith suffix is equal to the longest common prefix with the next suffix in the SA". This means the ith suffix is a prefix of the next SA entry. Therefore the next SA entry represents a cut occurring earlier in the input string, that produces a lexicographically equivalent new string.
  • continue means the current iteration of the loop is skipped to continue with the next iteration of the loop. I.e. the ith suffix is skipped.


来源:https://stackoverflow.com/questions/61848794/glass-beads-how-does-suffix-array-applied-here

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!