Common substring of two string in SQL

做~自己de王妃 提交于 2020-07-20 04:24:07

问题


I need to find common substring (without space) of two strings in SQL.

Query:

select *
from tbl as a, tbl as b
where a.str <> b.str

Sample data:

str1      | str2      | max substring without spaces
----------+-----------+-----------------------------
aabcdfbas | rikcdfva  | cdf
aaab akuc | aaabir a  | aaab
ab akuc   | ab atr    | ab

回答1:


I disagree with those who say that SQL is not the tool for this job. Until someone can show me a faster way than my solution in ANY programming language, I will aver that SQL (written in a set-based, side-effect free, using only immutable variables) is the ONLY tool for this job (when dealing with varchar(8000)- or nvarchar(4000)). The solution below is for varchar(8000).

1. A correctly indexed tally (numbers) table.

-- (1) build and populate a persisted (numbers) tally 
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (n int not null);

WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N))
INSERT dbo.tally
SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;

-- (2) Add Required constraints (and indexes) for performance
ALTER TABLE dbo.tally 
ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;

ALTER TABLE dbo.tally 
ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);

Note that a tally table function will not perform as well.

2. Using our tally table to return all possible substrings a string

Using "abcd" as an example, let's get all of it's substrings. Note my comments.

DECLARE @s1 varchar(8000) = 'abcd';

SELECT
  position  = t.N,
  tokenSize = x.N,
  string    = substring(@s1, t.N, x.N)  
FROM       dbo.tally t -- token position
CROSS JOIN dbo.tally x -- token length
WHERE t.N <=  len(@s1) -- all positions
AND   x.N <=  len(@s1) -- all lengths
AND   len(@s1) - t.N - (x.N-1) >= 0 -- filter unessesary rows [e.g.substring('abcd',3,2)]

This returns

position    tokenSize   string
----------- ----------- -------
1           1           a
2           1           b
3           1           c
4           1           d
1           2           ab
2           2           bc
3           2           cd
1           3           abc
2           3           bcd
1           4           abcd

3. dbo.getshortstring8K

What's this function about? The first major optimization. We're going to break the shorter of the two strings into every possible substring then see if it exists in the longer string. If you have two strings (S1 and S2) and S1 is longer than S2, we know that none of the substrings of S1, that are longer than S2, will be a substring of S2. That's the purpose of dbo.getshortstring: to ensure that we don't perform any unnecessary substring comparisons. This will make more sense in a moment.

This is hugely important because, the number of substrings in a string can be calculated using a Triangle Number Function. With N as the length (number of characters) in a string, the number of substrings can be calculated as N*(N+1)/2. E.g. "abc" has 6 substrings: 3*(3+1)/2 = 6; a,b,c,ab,bc,abc. If we're comparing "abc" to "abcdefgh" we don't need to check if "abcd" is a substring of "abc".

Breaking "abcdefgh" (length=8) into all possible substrings requires 8*(8+1)/2 = 36 operations (vs 6 for "abc").

IF OBJECT_ID('dbo.getshortstring8k') IS NOT NULL DROP FUNCTION dbo.getshortstring8k;
GO
CREATE FUNCTION dbo.getshortstring8k(@s1 varchar(8000), @s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN 
SELECT s1 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s1 ELSE @s2 END,
       s2 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s2 ELSE @s1 END;

4. Finding all subsrings of the shorter string that exist in the longer string:

DECLARE @s1 varchar(8000) = 'bcdabc', @s2 varchar(8000) = 'abcd';

SELECT
  s.s1, -- test to make sure s.s1 is the shorter of the two strings
  position  = t.N,
  tokenSize = x.N,
  string    = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s --<< get the shorter string
CROSS JOIN dbo.tally t  
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND   x.N between 1 and len(s.s1)
AND   len(s.s1) - t.N - (x.N-1) >= 0
AND   charindex(substring(s.s1, t.N, x.N), s.s2) > 0;

5. Retrieving ONLY the longest common substring(s)

This is the easy part. We simply Add TOP (1) WITH TIES to our SELECT statement and we're all set. Here, the longest common substring is "bc" and "xx"

DECLARE @s1 varchar(8000) = 'xxabcxx', @s2 varchar(8000) = 'bcdxx';

SELECT TOP (1) WITH TIES 
  position  = t.N,
  tokenSize = x.N,
  string    = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s
CROSS JOIN dbo.tally t  
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND   x.N between 1 and len(s.s1)
AND   len(s.s1) - t.N - (x.N-1) >= 0
AND   charindex(substring(s.s1, t.N, x.N), s.s2) > 0
ORDER BY x.N DESC;

6. Applying this logic to your table

Using APPLY we replace my variables @s1 and @s2 with the t.str1 & t.str2. I add a filter to exclude matches that contain spaces (see my comments)... And we're off:

-- easily consumbable sample data
DECLARE @yourtable TABLE (str1 varchar(8000), str2 varchar(8000));
INSERT @yourtable 
VALUES ('aabcdfbas','rikcdfva'),('aaab akuc','aaabir a'),('ab akuc','ab atr');

SELECT str1, str2,  [max substring without spaces] = string
FROM @yourtable t
CROSS APPLY 
(
    SELECT TOP (1) WITH TIES 
      position  = t.N,
      tokenSize = x.N,
      string    = substring(s.s1, t.N, x.N)
    FROM dbo.getshortstring8k(t.str1, t.str2) s -- @s1 & @s2 replaced with str1 & str2 
    CROSS JOIN dbo.tally t  
    CROSS JOIN dbo.tally x
    WHERE t.N between 1 and len(s.s1)
    AND   x.N between 1 and len(s.s1)
    AND   len(s.s1) - t.N - (x.N-1) >= 0
    AND   charindex(substring(s.s1, t.N, x.N), s.s2) > 0
    AND   charindex(' ',substring(s.s1, t.N, x.N)) = 0 -- exclude substrings with spaces
    ORDER BY x.N DESC
) lcss;

Results:

str1        str2      max substring without spaces
----------- --------- ------------------------------
aabcdfbas   rikcdfva  cdf
aaab akuc   aaabir a  aaab
ab akuc     ab atr    ab

And the execution plan:

... No sorts or unnecessary operations. Just speed. For longer strings (e.g. 50 characters+) I have an even faster technique you can read about here.



来源:https://stackoverflow.com/questions/49394121/common-substring-of-two-string-in-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!