R regex - extract words beginning with @ symbol

后端 未结 3 1842
太阳男子
太阳男子 2021-01-18 04:09

I\'m trying to extract twitter handles from tweets using R\'s stringr package. For example, suppose I want to get all words in a vector that begin with \"A\". I can do this

3条回答
  •  南方客
    南方客 (楼主)
    2021-01-18 04:30

    A couple of things about your regex:

    • (?<=\b) is the same as \b because a word boundary is already a zero width assertion
    • \@ is the same as @, as @ is not a special regex metacharacter and you do not have to escape it
    • [^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.

    So, your regex, \b@\S+, matches @i in h@i because there is a word boundary between h (a letter, a word char) and @ (a non-word char, not a letter, digit or underscore). Check this regex debugger.

    \b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B@\S+, and it will match @ that are either preceded with a non-word char or at the start of the string.

    x <- c("h@i", "hi @hello @me")
    regmatches(x, gregexpr("\\B@\\S+", x))
    ## => [[1]]
    ## character(0)
    ## 
    ## [[2]]
    ## [1] "@hello" "@me"   
    

    See the regex demo.

    If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:

    regmatches(x, gregexpr("(?

    where:

    • (? - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
    • (? - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.

    See this regex demo and another regex demo here.

    Note that the corresponding right hand boundaries are (?!\w) and (?!\S).

提交回复
热议问题