sys.dm_fts_parser sql full text

时光怂恿深爱的人放手 提交于 2019-12-06 05:11:43
Joseph Sturtevant

My team also ran into this strange tokenization behavior. Our fix was to apply LOWER to the query and search text.

As Haroon previously identified, the issue appears to arise from the tokenizer identifying certain uppercase currency codes in proximity to numbers and treating them differently. This is illustrated by the following query:

SELECT * FROM sys.dm_fts_parser ('"syp 123"', 1033, 0, 0) -- Works fine
SELECT * FROM sys.dm_fts_parser ('"SYP 123"', 1033, 0, 0) -- Doesn't work

We wrote a script that identifies all the 1-4 letter character combinations that exhibit this behavior:

DECLARE @CurrencyCodes TABLE (CurrencyCode varchar(4), TokenCount int)
DECLARE @Start int = 65 -- A
DECLARE @End int = 90 -- Z

DECLARE @A int, @B int, @C int, @D int

SET @A = @Start
WHILE NOT (@A > @End) BEGIN
    INSERT INTO @CurrencyCodes VALUES (CHAR(@A), NULL)
    SET @B = @Start
    WHILE NOT (@B > @End) BEGIN
        INSERT INTO @CurrencyCodes VALUES (CHAR(@A) + CHAR(@B), NULL)
        SET @C = @Start
        WHILE NOT (@C > @End) BEGIN
            INSERT INTO @CurrencyCodes VALUES (CHAR(@A) + CHAR(@B) + CHAR(@C), NULL)
                SET @D = @Start
                WHILE NOT (@D > @End) BEGIN
                    INSERT INTO @CurrencyCodes VALUES (CHAR(@A) + CHAR(@B) + CHAR(@C) + CHAR(@D), NULL)
                    SET @D = @D + 1
                END
            SET @C = @C + 1
        END
        SET @B = @B + 1
    END
    SET @A = @A + 1
END

UPDATE @CurrencyCodes SET TokenCount = (SELECT COUNT(1) FROM sys.dm_fts_parser ('"' + CurrencyCode + '123,456"', 1033, 0, 0))

SELECT CurrencyCode FROM @CurrencyCodes WHERE TokenCount = 2

From that query we found that we will have problems with any of the following 273 codes:

ADF ADP AED AFA AFN ALK ALL AMD ANG AOA AON AOR ARA ARL ARM ARP ARS ATS AUD AWG AZM AZN BAM BBD BDT BEC BEF BEL BGJ BGK BGL BGN BHD BIF BMD BND BOB BOP BOV BRB BRC BRE BRL BRN BRR BRZ BSD BTN BWP BYR BZD CAD CDF CFP CHE CHF CHW CLF CLP CNX CNY COP COU CRC CSD CSJ CSK CUP CVE CYP CZK DDM DEM DJF DKK DM DOP DZD ECS ECV EEK EGP EQE ERN ESA ESB ESP ETB EUR EURO FF FIM FJD FKP FRF GBP GEL GHC GHS GIP GMD GNE GNF GRD GTQ GWP GYD HKD HNL HRK HTG HUF IDR IEP ILP ILR ILS INR IQD IRR ISJ ISK ITL JMD JOD JPY KES KGS KHR KMF KPW KRW KWD KYD KZT LAJ LAK LBP LKR LRD LSL LTL LUF LVL LYD MAD MAF MCF MDL MGA MGF MKD MKN MMK MNT MOP MRO MTL MUR MVQ MVR MWK MXN MXP MXV MYR MZM MZN NAD NGN NIO NLG NOK NPR NZD OMR PAB PEH PEI PEN PGK PHP PKR PLN PLZ PTE PYG QAR ROL RON RSD RUB RUR RWF SAR SBD SCR SDD SDG SEK SGD SHP SIT SKK SLL SML SOS SRD SRG STD SUR SVC SYP SZL THB TJR TJS TMM TND TOP TPE TRL TRY TTD TWD TZS UAH UAK UGS UGX USD USN USS UYI UYN UYU UZS VAL VEB VEF VNC VND VUV WST XAF XAG XAU XBA XBB XBC XBD XCD XDR XEC XEU XFO XFU XOF XPD XPF XPT XTS YDD YER YUD YUM ZAL ZAR ZMK ZRN ZRZ ZWC ZWD

You have to use the correct Locale identifier(lcid) of the word breaker. Thus, you need to replace the second argument to 1033

SELECT * 
FROM sys.dm_fts_parser('"0 CAD"', 1033, null, 0)

returns 3 tokens

SELECT * 
FROM sys.dm_fts_parser('"0 cad"', 1033, null, 0)

returns 3 tokens

Also need to select English as the word-breaker language for each columns.

USE [YourDB]
GO
ALTER FULLTEXT INDEX ON [dbo].[YourTable] DROP ([YourColumn]) 
GO
USE [YourDB]
GO
ALTER FULLTEXT INDEX ON [dbo].[YourTable] ADD ([YourColumn] LANGUAGE [English])
GO

I used something like this ..

Select * from FROM sys.dm_fts_parser(N'"'+@P_SEARCHSTRING +'" ', 1033, 0, 0) where display_term NOT LIKE 'nn%'

So it avoids all tokens starting with 'nn'

Haroon

When using SQL 2008

select * from sys.dm_fts_parser('"0 CAD"', 0, null, 0) - gives 2 tokens   
select * from sys.dm_fts_parser('"0 CAD"', 1033, null, 0) - gives 3 tokens   

On SQL 2012 (11.0.3218):

select * from sys.dm_fts_parser('"0 CAD"', 1033, null, 0) - gives 2 tokens

In SQL 2012 Microsoft introduced a new word breaker (version 14.0.4763.1000) http://msdn.microsoft.com/en-us/library/gg509108.aspx

It seems that the work-breaker now recognizes 3 character ISO 4217 Currency Codes, and if there is a number prior to the 3 char code it is not broken up.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!