How to find repeating sequence of characters in a given array?

…衆ロ難τιáo~ 提交于 2019-11-27 00:34:33

问题


My problem is to find the repeating sequence of characters in the given array. simply, to identify the pattern in which the characters are appearing.

   .---.---.---.---.---.---.---.---.---.---.---.---.---.---.
1: | J | A | M | E | S | O | N | J | A | M | E | S | O | N |
   '---'---'---'---'---'---'---'---'---'---'---'---'---'---'

   .---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.
2: | R | O | N | R | O | N | R | O | N | R | O | N | R | O | N |
   '---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'

   .---.---.---.---.---.---.---.---.---.---.---.---.
3: | S | H | A | M | I | L | S | H | A | M | I | L |
   '---'---'---'---'---'---'---'---'---'---'---'---'

   .---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.
4: | C | A | R | P | E | N | T | E | R | C | A | R | P | E | N | T | E | R |
   '---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'

Example

Given the previous data, the result should be:

  1. "JAMESON"
  2. "RON"
  3. "SHAMIL"
  4. "CARPENTER"

Question

  • How to deal with this problem efficiently?

回答1:


For your examples, my first approach would be to

  1. get the first character of the array (for your last example, that would be C)
  2. get the index of the next appearance of that character in the array (e.g. 9)
  3. if it is found, search for the next appearance of the substring between the two appearances of the character (in this case CARPENTER)
  4. if it is found, you're done (and the result is this substring).

Of course, this works only for a very limited subset of possible arrays, where the same word is repeated over and over again, starting from the beginning, without stray characters in between, and its first character is not repeated within the word. But all your examples fall into this category - and I prefer the simplest solution which could possibly work :-)

If the repeated word contains the first character multiple times (e.g. CACTUS), the algorithm can be extended to look for subsequent occurrences of that character too, not only the first one (so that it finds the whole repeated word, not only a substring of it).

Note that this extended algorithm would give a different result for your second example, namely RONRON instead of RON.




回答2:


Tongue-in-cheek O(NlogN) solution

Perform an FFT on your string (treating characters as numeric values). Every peak in the resulting graph corresponds to a substring periodicity.




回答3:


In Python, you can leverage regexes thus:

def recurrence(text):
    import re
    for i in range(1, len(text)/2 + 1):
        m = re.match(r'^(.{%d})\1+$'%i, text)
        if m: return m.group(1)

recurrence('abcabc') # Returns 'abc'

I'm not sure how this would translate to Java or C. (That's one of the reasons I like Python, I guess. :-)




回答4:


First write a method that find repeating substring sub in the container string as below.

boolean findSubRepeating(String sub, String container);

Now keep calling this method with increasing substring in the container, first try 1 character substring, then 2 characters, etc going upto container.length/2.




回答5:


Pseudocode

len = str.length
for (i in 1..len) {
   if (len%i==0) {
      if (str==str.substr(0,i).repeat(len/i)) {
         return str.substr(0,i)
      }
   }
}

Note: For brevity, I'm inventing a "repeat" method for strings, which isn't actually part of Java's string; "abc".repeat(2)="abcabc"




回答6:


Using C++:

//Splits the string into the fragments of given size
//Returns the set of of splitted strings avaialble
set<string> split(string s, int frag)
{
    set<string> uni;
    int len = s.length();
    for(int i = 0; i < len; i+= frag)
    {
        uni.insert(s.substr(i, frag));
    }

    return uni;
}

int main()
{

    string out;
    string s = "carpentercarpenter";
    int len = s.length();

      //Optimistic approach..hope there are only 2 repeated strings
      //If that fails, then try to break the strings with lesser number of
      //characters
    for(int i = len/2; i>1;--i)
    {
        set<string> uni = split(s,i);
        if(uni.size() == 1)
        {
            out = *uni.begin();
            break;
        }
    }

    cout<<out;
    return 0;

}



回答7:


The first idea that comes to my mind is trying all repeating sequences of lengths that divide length(S) = N. There is a maximum of N/2 such lengths, so this results in a O(N^2) algorithm.

But i'm sure it can be improved...




回答8:


and here is a concrete working example:

/* find greatest repeated substring */
char *fgrs(const char *s,size_t *l)
{
  char *r=0,*a=s;
  *l=0;
  while( *a )
  {
    char *e=strrchr(a+1,*a);
    if( !e )
      break;
    do {
      size_t t=1;
      for(;&a[t]!=e && a[t]==e[t];++t);
      if( t>*l )
        *l=t,r=a;
      while( --e!=a && *e!=*a );
    } while( e!=a && *e==*a );
    ++a;
  }
  return r;
}

  size_t t;
  const char *p;
  p=fgrs("BARBARABARBARABARBARA",&t);
  while( t-- ) putchar(*p++);
  p=fgrs("0123456789",&t);
  while( t-- ) putchar(*p++);
  p=fgrs("1111",&t);
  while( t-- ) putchar(*p++);
  p=fgrs("11111",&t);
  while( t-- ) putchar(*p++);



回答9:


I'd convert the array to a String object and use regex




回答10:


Not sure how you define "efficiently". For easy/fast implementation you could do this in Java:

    private static String findSequence(String text) {
        Pattern pattern = Pattern.compile("(.+?)\\1+");
        Matcher matcher = pattern.matcher(text);
        return matcher.matches() ? matcher.group(1) : null;
    }

it tries to find the shortest string (.+?) that must be repeated at least once (\1+) to match the entire input text.




回答11:


Put all your character in an array e.x. a[]

i=0; j=0;
for( 0 < i < count ) 
{
if (a[i] == a[i+j+1])
    {++i;}
else
    {++j;i=0;}
}

Then the ratio of (i/j) = repeat count in your array. You must pay attention to limits of i and j, but it is the simple solution.




回答12:


Here is a more general solution to the problem, that will find repeating subsequences within an sequence (of anything), where the subsequences do not have to start at the beginning, nor immediately follow each other.

given an sequence b[0..n], containing the data in question, and a threshold t being the minimum subsequence length to find,

l_max = 0, i_max = 0, j_max = 0;
for (i=0; i<n-(t*2);i++) {
  for (j=i+t;j<n-t; j++) {
    l=0;
    while (i+l<j && j+l<n && b[i+l] == b[j+l])
      l++;
    if (l>t) {
      print "Sequence of length " + l + " found at " + i + " and " + j);
      if (l>l_max) {
        l_max = l;
        i_max = i;
        j_max = j;
      }
    }
  }
}
if (l_max>t) {
  print "longest common subsequence found at " + i_max + " and " + j_max + " (" + l_max + " long)";
}

Basically:

  1. Start at the beginning of the data, iterate until within 2*t of the end (no possible way to have two distinct subsequences of length t in less than 2*t of space!)
  2. For the second subsequence, start at least t bytes beyond where the first sequence begins.
  3. Then, reset the length of the discovered subsequence to 0, and check to see if you have a common character at i+l and j+l. As long as you do, increment l. When you no longer have a common character, you have reached the end of your common subsequence. If the subsequence is longer than your threshold, print the result.



回答13:


Just figured this out myself and wrote some code for this (written in C#) with a lot of comments. Hope this helps someone:

// Check whether the string contains a repeating sequence.
public static bool ContainsRepeatingSequence(string str)
{
    if (string.IsNullOrEmpty(str)) return false;

    for (int i=0; i<str.Length; i++)
    {
        // Every iteration, cut down the string from i to the end.
        string toCheck = str.Substring(i);

        // Set N equal to half the length of the substring. At most, we have to compare half the string to half the string. If the string length is odd, the last character will not be checked against, but it will be checked in the next iteration.
        int N = toCheck.Length / 2;

        // Check strings of all lengths from 1 to N against the subsequent string of length 1 to N.
        for (int j=1; j<=N; j++)
        {
            // Check from beginning to j-1, compare against j to j+j.
            if (toCheck.Substring(0, j) == toCheck.Substring(j, j)) return true;
        }
    }

    return false;
}

Feel free to ask any questions if it's unclear why it works.



来源:https://stackoverflow.com/questions/3675390/how-to-find-repeating-sequence-of-characters-in-a-given-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!