I recetly come across an interview question : To find all the repeating substring in a given string with a minimal size of 2. The algorithm should be efficient one.
Cod
That's just a wild idea, but worth a try (however, it consumes O(N) memory, where N is length of the primary string). The algorithm is not O(N), but maybe it can be optimized.
The idea is, that you don't want to make string comparisons often. You can collect the hash of read data (for example a sum of ASCII codes of read characters) and compare the hashes. If the hashes are equal, the strings may be equal (it has to be checked later). For example:
ABCAB
A -> (65)
B -> (131, 66)
C -> (198, 133, 67)
A -> (263, 198, 132, 65)
B -> (329, 264, 198, 131, 66)
Because you're interested only in 2+ length values, you have to omit the last value (because it always corresponds to the single character).
We see two equal values: 131 and 198. 131 stands for AB and reveals the pair, however 198 stands both for ABC and BCA, which have to be rejected by manual check.
That's only the idea, not the solution itself. The hash function may be extended to account the position of character in substring (or the sequence structure). Storage method of hash values may be changed to improve performance (however in cost of increased memory usage).
Hope I helped just a little :)