问题
I need to find common substring (without space) of two strings in SQL.
Query:
select *
from tbl as a, tbl as b
where a.str <> b.str
Sample data:
str1 | str2 | max substring without spaces
----------+-----------+-----------------------------
aabcdfbas | rikcdfva | cdf
aaab akuc | aaabir a | aaab
ab akuc | ab atr | ab
回答1:
I disagree with those who say that SQL is not the tool for this job. Until someone can show me a faster way than my solution in ANY programming language, I will aver that SQL (written in a set-based, side-effect free, using only immutable variables) is the ONLY tool for this job (when dealing with varchar(8000)- or nvarchar(4000)). The solution below is for varchar(8000).
1. A correctly indexed tally (numbers) table.
-- (1) build and populate a persisted (numbers) tally
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (n int not null);
WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N))
INSERT dbo.tally
SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;
-- (2) Add Required constraints (and indexes) for performance
ALTER TABLE dbo.tally
ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;
ALTER TABLE dbo.tally
ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);
Note that a tally table function will not perform as well.
2. Using our tally table to return all possible substrings a string
Using "abcd" as an example, let's get all of it's substrings. Note my comments.
DECLARE @s1 varchar(8000) = 'abcd';
SELECT
position = t.N,
tokenSize = x.N,
string = substring(@s1, t.N, x.N)
FROM dbo.tally t -- token position
CROSS JOIN dbo.tally x -- token length
WHERE t.N <= len(@s1) -- all positions
AND x.N <= len(@s1) -- all lengths
AND len(@s1) - t.N - (x.N-1) >= 0 -- filter unessesary rows [e.g.substring('abcd',3,2)]
This returns
position tokenSize string
----------- ----------- -------
1 1 a
2 1 b
3 1 c
4 1 d
1 2 ab
2 2 bc
3 2 cd
1 3 abc
2 3 bcd
1 4 abcd
3. dbo.getshortstring8K
What's this function about? The first major optimization. We're going to break the shorter of the two strings into every possible substring then see if it exists in the longer string. If you have two strings (S1 and S2) and S1 is longer than S2, we know that none of the substrings of S1, that are longer than S2, will be a substring of S2. That's the purpose of dbo.getshortstring: to ensure that we don't perform any unnecessary substring comparisons. This will make more sense in a moment.
This is hugely important because, the number of substrings in a string can be calculated using a Triangle Number Function. With N as the length (number of characters) in a string, the number of substrings can be calculated as N*(N+1)/2. E.g. "abc" has 6 substrings: 3*(3+1)/2 = 6; a,b,c,ab,bc,abc. If we're comparing "abc" to "abcdefgh" we don't need to check if "abcd" is a substring of "abc".
Breaking "abcdefgh" (length=8) into all possible substrings requires 8*(8+1)/2 = 36 operations (vs 6 for "abc").
IF OBJECT_ID('dbo.getshortstring8k') IS NOT NULL DROP FUNCTION dbo.getshortstring8k;
GO
CREATE FUNCTION dbo.getshortstring8k(@s1 varchar(8000), @s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT s1 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s1 ELSE @s2 END,
s2 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s2 ELSE @s1 END;
4. Finding all subsrings of the shorter string that exist in the longer string:
DECLARE @s1 varchar(8000) = 'bcdabc', @s2 varchar(8000) = 'abcd';
SELECT
s.s1, -- test to make sure s.s1 is the shorter of the two strings
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s --<< get the shorter string
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0;
5. Retrieving ONLY the longest common substring(s)
This is the easy part. We simply Add TOP (1) WITH TIES
to our SELECT
statement and we're all set. Here, the longest common substring is "bc" and "xx"
DECLARE @s1 varchar(8000) = 'xxabcxx', @s2 varchar(8000) = 'bcdxx';
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
ORDER BY x.N DESC;
6. Applying this logic to your table
Using APPLY we replace my variables @s1 and @s2 with the t.str1 & t.str2. I add a filter to exclude matches that contain spaces (see my comments)... And we're off:
-- easily consumbable sample data
DECLARE @yourtable TABLE (str1 varchar(8000), str2 varchar(8000));
INSERT @yourtable
VALUES ('aabcdfbas','rikcdfva'),('aaab akuc','aaabir a'),('ab akuc','ab atr');
SELECT str1, str2, [max substring without spaces] = string
FROM @yourtable t
CROSS APPLY
(
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(t.str1, t.str2) s -- @s1 & @s2 replaced with str1 & str2
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
AND charindex(' ',substring(s.s1, t.N, x.N)) = 0 -- exclude substrings with spaces
ORDER BY x.N DESC
) lcss;
Results:
str1 str2 max substring without spaces
----------- --------- ------------------------------
aabcdfbas rikcdfva cdf
aaab akuc aaabir a aaab
ab akuc ab atr ab
And the execution plan:
... No sorts or unnecessary operations. Just speed. For longer strings (e.g. 50 characters+) I have an even faster technique you can read about here.
来源:https://stackoverflow.com/questions/49394121/common-substring-of-two-string-in-sql