Is there a Boyer-Moore string search and fast search and replace function and fast string count for Delphi 2010 String (UnicodeString) out there?

前端 未结 2 1046
广开言路
广开言路 2020-12-07 17:56

I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.

I have run into Boyer-Moore string sea

相关标签:
2条回答
  • 2020-12-07 18:32

    This answer is now complete and works for case sensitive mode, but does not work for case insensitive mode, and probably has other bugs too, since it's not well unit tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages).

    Based on Dorin Dominica's code, I built the following.

    { _FindStringBoyer:
      Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
      Credited to Dorin Duminica.
    }
    function _FindStringBoyer(const sString, sPattern: string;
      const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;
    
        function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
        begin
          if bCaseSensitive then
            Result := (sString[StringIndex] = sPattern[PatternIndex])
          else
            Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
        end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    
    var
      SkipTable: array [Char] of Integer;
      LengthPattern: Integer;
      LengthString: Integer;
      Index: Integer;
      kIndex: Integer;
      LastMarker: Integer;
      Large: Integer;
      chPattern: Char;
    begin
      if fromPos < 1 then
        raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
      LengthPattern := Length(sPattern);
      LengthString := Length(sString);
      for chPattern := Low(Char) to High(Char) do
        SkipTable[chPattern] := LengthPattern;
      for Index := 1 to LengthPattern -1 do
        SkipTable[sPattern[Index]] := LengthPattern - Index;
      Large := LengthPattern + LengthString + 1;
      LastMarker := SkipTable[sPattern[LengthPattern]];
      SkipTable[sPattern[LengthPattern]] := Large;
      Index := fromPos + LengthPattern -1;
      Result := 0;
      while Index <= LengthString do begin
        repeat
          Index := Index + SkipTable[sString[Index]];
        until Index > LengthString;
        if Index <= Large then
          Break
        else
          Index := Index - Large;
        kIndex := 1;
        while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
          Inc(kIndex);
        if kIndex = LengthPattern then begin
          // Found, return.
          Result := Index - kIndex + 1;
          Index := Index + LengthPattern;
          exit;
        end else begin
          if __SameChar(Index, LengthPattern) then
            Index := Index + LastMarker
          else
            Index := Index + SkipTable[sString[Index]];
        end; // if kIndex = LengthPattern then begin
      end; // while Index <= LengthString do begin
    end;
    
    { Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
      a substring inside the main string, at a much faster rate than we
      could have done otherwise.  Another thing that would be great is
      to have a function that returns an array of find-locations,
      which would be way faster to do than repeatedly calling Pos.
    }
    function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
    var
      foundPos:Integer;
      fromPos:Integer;
      Limit:Integer;
      guard:Integer;
      SkipTable: array [Char] of Integer;
      LengthPattern: Integer;
      LengthString: Integer;
      Index: Integer;
      kIndex: Integer;
      LastMarker: Integer;
      Large: Integer;
      chPattern: Char;
        function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
        begin
          if CaseSensitive then
            Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
          else
            Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
        end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    
    begin
      result := 0;
      foundPos := 1;
      fromPos := 1;
      Limit := Length(aSourceString);
      guard := Length(aFindString);
      Index := 0;
      LengthPattern := Length(aFindString);
      LengthString := Length(aSourceString);
      for chPattern := Low(Char) to High(Char) do
        SkipTable[chPattern] := LengthPattern;
      for Index := 1 to LengthPattern -1 do
        SkipTable[aFindString[Index]] := LengthPattern - Index;
      Large := LengthPattern + LengthString + 1;
      LastMarker := SkipTable[aFindString[LengthPattern]];
      SkipTable[aFindString[LengthPattern]] := Large;
      while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin
    
        Index := fromPos + LengthPattern -1;
        if Index>Limit then
            break;
        kIndex := 0;
        while Index <= LengthString do begin
          repeat
            Index := Index + SkipTable[aSourceString[Index]];
          until Index > LengthString;
          if Index <= Large then
            Break
          else
            Index := Index - Large;
          kIndex := 1;
          while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
            Inc(kIndex);
          if kIndex = LengthPattern then begin
            // Found, return.
            //Result := Index - kIndex + 1;
            Index := Index + LengthPattern;
            fromPos := Index;
            Inc(Result);
            break;
          end else begin
            if __SameChar(Index, LengthPattern) then
              Index := Index + LastMarker
            else
              Index := Index + SkipTable[aSourceString[Index]];
          end; // if kIndex = LengthPattern then begin
        end; // while Index <= LengthString do begin
    
      end;
    end; 
    

    This is really a nice Algorithm, because:

    • it's way faster to count instances of substring X in string Y this way, magnificently so.
    • For merely replacing Pos() the _FindStringBoyer() is faster than the pure asm version of Pos() contributed to Delphi by FastCode project people, that is currently used for Pos, and if you need the case-insensitivity, you can imagine the performance boost when we don't have to call UpperCase on a 100 megabyte string. (Okay, your strings aren't going to be THAT big. But still, Efficient Algorithms are a Thing of Beauty.)

    Okay I wrote a String Replace in Boyer-Moore style:

    function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
    var
      errors:Integer;
      fromPos:Integer;
      Limit:Integer;
      guard:Integer;
      SkipTable: array [Char] of Integer;
      LengthPattern: Integer;
      LengthString: Integer;
      Index: Integer;
      kIndex: Integer;
      LastMarker: Integer;
      Large: Integer;
      chPattern: Char;
      CaseSensitive:Boolean;
      foundAt:Integer;
      lastFoundAt:Integer;
      copyStartsAt:Integer;
      copyLen:Integer;
        function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
        begin
          if CaseSensitive then
            Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
          else
            Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
        end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    
    begin
      result := '';
      lastFoundAt := 0;
      fromPos := 1;
      errors := 0;
      CaseSensitive := rfIgnoreCase in Flags;
      Limit := Length(aSourceString);
      guard := Length(aFindString);
      Index := 0;
      LengthPattern := Length(aFindString);
      LengthString := Length(aSourceString);
      for chPattern := Low(Char) to High(Char) do
        SkipTable[chPattern] := LengthPattern;
      for Index := 1 to LengthPattern -1 do
        SkipTable[aFindString[Index]] := LengthPattern - Index;
      Large := LengthPattern + LengthString + 1;
      LastMarker := SkipTable[aFindString[LengthPattern]];
      SkipTable[aFindString[LengthPattern]] := Large;
      while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin
    
        Index := fromPos + LengthPattern -1;
        if Index>Limit then
            break;
        kIndex := 0;
        foundAt := 0;
        while Index <= LengthString do begin
          repeat
            Index := Index + SkipTable[aSourceString[Index]];
          until Index > LengthString;
          if Index <= Large then
            Break
          else
            Index := Index - Large;
          kIndex := 1;
          while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
            Inc(kIndex);
          if kIndex = LengthPattern then begin
    
    
            foundAt := Index - kIndex + 1;
            Index := Index + LengthPattern;
            //fromPos := Index;
            fromPos := (foundAt+LengthPattern);
            if lastFoundAt=0 then begin
                    copyStartsAt := 1;
                    copyLen := foundAt-copyStartsAt;
            end else begin
                    copyStartsAt := lastFoundAt+LengthPattern;
                    copyLen := foundAt-copyStartsAt;
            end;
    
            if (copyLen<=0)or(copyStartsAt<=0) then begin
                    Inc(errors);
            end;
    
            Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
            lastFoundAt := foundAt;
            if not (rfReplaceAll in Flags) then
                     fromPos := 0; // break out of outer while loop too!
            break;
          end else begin
            if __SameChar(Index, LengthPattern) then
              Index := Index + LastMarker
            else
              Index := Index + SkipTable[aSourceString[Index]];
          end; // if kIndex = LengthPattern then begin
        end; // while Index <= LengthString do begin
      end;
      if (lastFoundAt=0) then
      begin
         // nothing was found, just return whole original string
          Result := aSourceString;
      end
      else
      if (lastFoundAt+LengthPattern < Limit) then begin
         // the part that didn't require any replacing, because nothing more was found,
         // or rfReplaceAll flag was not specified, is copied at the
         // end as the final step.
        copyStartsAt := lastFoundAt+LengthPattern;
        copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
        Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
      end;
    
    end;
    

    Okay, problem: Stack footprint of this:

    var
      skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi
    

    Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.

    0 讨论(0)
  • 2020-12-07 18:41

    Since I was just looking for the same: Jedi JCL has got a unicode aware search engine using Boyer-Moore in jclUnicode.pas. I have no idea how good or how fast it is yet.

    0 讨论(0)
提交回复
热议问题