Oracle Function to return similarity between strings

问题

I have an interesting problem and am wondering if oracle has a built-in function to do this or I need to find a fast way to do it in plsql.

Take 2 strings:

  s1 = 'abc def hijk'
  s2 = 'abc def iosk'

The function needs to return abc def because the strings are exactly the same up to that point.

Another example:

  s1 = 'abc def hijk www'
  s2 = 'abc def iosk www'

The function needs to return abc def.

The only way I can think of doing this is loop through string1 and compare each character with substr() again the substr of string 2.

Just wondering if Oracle's got something built-in. Performance is pretty important.

回答1:

After re-reading your question, here would be what you really wanted:

with cte1 as  (
    select 1 id, 'abc def hijk www' str from dual
    union all
    select 2 id, 'abc def iosk www' str from dual
), num_gen as (
    -- a number generator up to the minimum length of the strings
    SELECT level num
    FROM dual t
    CONNECT BY level <= (select min(length(str)) from cte1)
), cte2 as (
    -- build substrings of increasing length
    select id, num_gen.num, substr(cte1.str, 1, num_gen.num) sub
    from cte1
    cross join num_gen
), cte3 as (
    -- self join to check if the substrings are equal
    select x1.num, x1.sub sub1, x2.sub sub2
    from cte2 x1
    join cte2 x2 on (x1.num = x2.num and x1.id != x2.id)
), cte4 as (
    -- select maximum string length
    select max(num) max_num
    from cte3
    where sub1 = sub2
)
    -- finally, get the substring with the max length
    select cte3.sub1
    from cte3
    join cte4 on (cte4.max_num = cte3.num)
    where rownum = 1

Essentially, this is what you would do in pl/sql: Build substrings of increasing length and stop at the point at which they are not matching anymore.

回答2:

I doubt that there is some built-in SQL function, but it can be done in SQL only using regular expressions:

with cte1 as  (
    select 1 id, 'abc def hijk www' str from dual
    union all
    select 2 id, 'abc def iosk www' str from dual
), cte2 as (
    SELECT distinct id, trim(regexp_substr(str, '[^ ]+', 1, level)) str
    FROM cte1 t
    CONNECT BY instr(str, ' ', 1, level - 1) > 0
)
select distinct t1.str
from cte2 t1
join cte2 t2 on (t1.str = t2.str and t1.id != t2.id)

I haven't done any performance tests, but my experience tells me this is most likely faster than any pl/sql solution since you are totally avoiding context switches.

回答3:

You should check the package UTL_MATCH for a similar functionality, but the get exact your request you must write own function.

The binary search for the common substring length provides good performance for long strings.

 create or replace function ident_pfx(str1 varchar2, str2 varchar2) return varchar2
 as
  len_beg PLS_INTEGER;
  len_end PLS_INTEGER;
  len_mid PLS_INTEGER;
  len_result PLS_INTEGER; 
 begin
    if str1 is null or str2 is null then return null; end if;
  -- 
    len_result := 0;
    len_beg := 0;
    len_end := least(length(str1),length(str2));

     LOOP  
     BEGIN
       -- use binary search for the common substring length
       len_mid := ceil((len_beg + len_end) / 2);

       IF (substr(str1,1,len_mid) = substr(str2,1,len_mid))
       THEN
          len_beg := len_mid; len_result := len_mid;
       ELSE
          len_end := len_mid;
       END IF;
      END;

      IF (len_end - len_beg) <= 1 THEN
        -- check last character
        IF (substr(str1,1,len_end) = substr(str2,1,len_end))
        THEN
         len_result := len_end;
        END IF;       
        EXIT ; 
      END IF;       
     END LOOP;
  return substr(str1,1,len_result);
 end;
 /


 select ident_pfx('abc def hijk www','abc def iosk www') ident_pfx from dual;

 abc def

回答4:

Another possible solution would be to use the XOR. If you XOR the two strings together, the result should have a NUL byte whereever the two strings match.

XOR is not a native operator, but i am pretty sure there is support for it in one of the libraries.

回答5:

If "the performance is pretty important", you should avoid the "looping" on substrings.

Here an alternative using the XOR (as proposed by @EvilTeach).

 with string_transform as  (
     select 'abc def hijk www' str1, 'abc def iosk www' str2 from dual
 ),
 str as (
 select 
  str1, str2,
  -- add suffix to handle nulls and identical strings
  -- calculate XOR
  utl_raw.bit_xor(utl_raw.cast_to_raw(str1||'X'),utl_raw.cast_to_raw(str2||'Y')) str1_xor_str2
 from string_transform 
 ), str2 as (
 select 
   str1, str2,
   str1_xor_str2,
   -- replace all non-identical characters (not 00) with 2D = '-'
   utl_raw.translate(str1_xor_str2,
                     utl_raw.translate(str1_xor_str2,'00','01'),
                     utl_raw.copies('2D',length(str1_xor_str2))) xor1
 from str
 ), str3 as (
 select 
   str1, str2,
   -- replace all identical characters (00) with 2B (= '+') and cast back to string
   utl_raw.cast_to_varchar2(utl_raw.translate(xor1,'00','2B')) diff
   -- diff = ++++++++---+++++ (+ means identical position; - difference)
 from str2
 )
 select str1, str2, 
  -- remove the appended suffix character
  substr(diff,1,length(diff)-1) diff,
  -- calculate the length of the identical prefix
  instr(diff,'-')-1 same_prf_length
 from str3 
 ;

Basically both strings are first converted to RAW format. XOR sets the identical bytes (characters) to 00. With translate the identical bytes are converted to '+', all other to '-'. The identical prefix length is the position of the first '-' in the string minus one. Technically a (different) sufix character is added to both strings to hanlde NULLs and identical strings.

Note that if the string is longer that 2000, some extra processing must be added due to limitation of UTL_RAW.CAST_TO_VARCHAR2.

来源：https://stackoverflow.com/questions/31756297/oracle-function-to-return-similarity-between-strings

标签

string

Oracle

performance

plsql