How to define a regular expression with multiple OR operators where each term includes a space prefix and suffix?

好久不见. 提交于 2019-12-23 12:00:09

问题


I am preparing for a data extraction task. I need to remove a set of terms; none, some or all may be present in each source record string. There are over 100,000 target records. I want to avoid performing single term match/replace actions, since (a) the list of terms-to-be-removed will likely grow, and (b) the time to perform the current match/replace action one term at a time is unacceptable.

My question: how do I modify the regular expression to include each term within the OR separated list?

REGULAR EXPRESSION

' and | and or | a o | company | co | c o | dba | d b a '

DESIRED BEHAVIOR

Replace each found term (including the prefix and suffix spaces) with a single space.

ACTUAL BEHAVIOR

Every "even" (as opposed to "odd") term found is replaced (including the prefix and suffix spaces) with a single space.

EXAMPLE

Source String

'   MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT  '

Result String (Desired Behavior)

'   MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT  '

Result String (Actual Behavior)

'   MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT  '

ENVIRONMENT

SQL Server 2005

User Defined Function regexReplace relying on VBScript.RegExp (code available at end of post)

CODE

set nocount on

declare @source [varchar](800)
declare @regexp [varchar](400)
declare @replace [char](1)
declare @globalReplace [bit]
declare @ignoreCase [bit]
declare @result [varchar](800)

set @globalReplace = 1
set @ignoreCase = 1

SET @source = '   MASHABLE LTD DBA THE INFORMATION EXPERTS and and or a o company co c o dba d b a COPYRIGHT  '
set @regexp = ' and | and or | a o | company | co | c o | dba | d b a '
set @replace = ' '

select @result = master.dbo.regexReplace(@source,@regexp,@replace,@globalReplace,@ignoreCase)

print @result

... producing the result:

   MASHABLE LTD THE INFORMATION EXPERTS and or company c o d b a COPYRIGHT  

* dbo.regexReplace user-defined function definition *

CREATE FUNCTION [dbo].[regexReplace]
    (
    @source varchar(5000),
    @regexp varchar(1000),
    @replace varchar(1000),
    @globalReplace bit = 0,
    @ignoreCase bit = 0
    )
    RETURNS varchar(1000) AS
        BEGIN
            DECLARE @hr integer
            DECLARE @objRegExp integer
            DECLARE @result varchar(5000)

            EXECUTE @hr = sp_OACreate 'VBScript.RegExp', @objRegExp OUTPUT
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END
            EXECUTE @hr = sp_OASetProperty @objRegExp, 'Pattern', @regexp
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END
            EXECUTE @hr = sp_OASetProperty @objRegExp, 'Global', @globalReplace
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END
            EXECUTE @hr = sp_OASetProperty @objRegExp, 'IgnoreCase', @ignoreCase
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END

            EXECUTE @hr = sp_OAMethod @objRegExp, 'Replace', @result OUTPUT, @source, @replace
            IF @hr <> 0 
                BEGIN
                    EXEC @hr = sp_OADestroy @objRegExp
                    RETURN NULL
                END

            EXECUTE @hr = sp_OADestroy @objRegExp
                IF @hr <> 0 
                    BEGIN
                        RETURN NULL
                    END

            RETURN @result
        END

回答1:


Try this:

(?: (?:and or|and|a o|company|co|c o|dba|d b a))+(?!\S)/i

Like @mathematical.coffee, I started by factoring out the leading space and replacing the trailing space with a lookahead--in this case, a negative lookahead for a non-whitespace character. This way it will work even if the token is the last one in the string and not followed by a space. But the most important change is replacing two or more matches at a time whenever possible.




回答2:


This isn't a SQL Server problem. This is a general RegEx problem - and not just the one included in the VBScript engine which you are accessing through COM. The problem is that the matches actually overlap between old and new prefix spaces.

I tried your example in http://www.regextester.com/ and it does the same thing.

The " and or " which is the first thing not replaced is actually made up of the space from the first " and " which was replaced by a space and then the remaining text.

I would look at using word boundary replacement instead: Regex match and replace word delimited by certain characters




回答3:


I'd recommend this regex:

( (and(?: or)?|a o|company|c ?o|d ?b ?a)(?= ))

First of all, I put the prefix/suffix spaces outside your OR brackets (efficiency):

( (and(?: or)?|a o|company|c ?o|d ?b ?a) )

However when you use this regex your matches overlap. For example and and or matches first the and, but then the remaining string is and or which doesn't have the preceding space.

So to get around this, I changed the last space to a positive lookahead. It says "make sure this pattern is followed by a space", but doesn't match the space itself.

So when going through and and or it matches and and leaves and or, which also matches the pattern. It more-or-less removes the problem of overlapping matches. This won't match one of your words if it occurs at the end of a string, but your original regex didn't anyway.

You can see it in action at the regexr site. Note that if you replace each match with a space you'll end up with way too many spaces:

MASHABLE LTD  THE INFORMATION EXPERTS         COPYRIGHT

But you'd have that problem with your original regex anyway. If you remove the matches entirely you'll get:

MASHABLE LTD THE INFORMATION EXPERTS COPYRIGHT


来源:https://stackoverflow.com/questions/8932453/how-to-define-a-regular-expression-with-multiple-or-operators-where-each-term-in

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!