R regex lookbehind with a long expression

问题

I have a long character that comes from a pdf extraction. Below is a MWE :

MWE <- "4 BLABLA\r\n Table 1. Real GDP\r\n Percentage changes\r\n 2016 2017 \r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA \r\n Table 2. Nominal GDP\r\n Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"

I want to separate this into a list, with each element being a table. I can do that with :

MWE_1 <- as.list(strsplit(MWE, "(?<=[Table\\s+\\d+\\.\\s+(([A-z]|[ \t]))+\\r\\n])"))

> MWE_1
[[1]]
[1] "4 BLABLA\r\n "                                                                                 
[2] " Percentage changes\r\n 2016 2017 \r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA 5\r\n "
[3] " Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"

But I would like to keep the delimiter, that is here a realtively long regular expression. I have looked a bit and it seems a good way to go is to try lookbehinds. However, I do not know how to concatenante my long regular expression. For instance,
MWE_2 <- as.list(strsplit(MWE, "(?<=[Table\\s+\\d+\\.\\s+(([A-z]|[ \t]))+\\r\\n])"))

yields an error :

invalid regular expression '(?<=[Table\s+\d+\.\s+(([A-z]|[  ]))+\r\n])', reason 'Invalid regexp'

How to do so in a compact way ?

Also, is there a direct way not to keep the first element ?

回答1:

Try lookahead and simplify what you are looking for:

R specific string escaping provided.

(?=Table \\d+\\.)

Make sure to enable perl=TRUE

https://regex101.com/r/Cpyu6k/1

回答2:

I am not clear why it does not work with ?<= …

Regular Expressions as used in R says it (you have repetition quantifiers + in the pattern):

Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \C in ....

I still have my issue with the 5 elements, and not a beginning of a clue why,
> MWE_2
[[1]]
[1] "4 BLABLA\r\n"
[2] " "
[3] "Table 1. Real GDP\r\n Percentage changes\r\n 2016 2017\r\nArgentina -2.5 2.7\r\nAustralia 2.6 2.5\r\n BLABLA \r\n"
[4] " "
[5] "Table 2. Nominal GDP\r\n Percentage changes\r\n 2011 2012\r\nArgentina 31.1 21.1\r\nAustralia 7.7 3.3\r\n"
but I can delete the empty elements afterwards…

There are not empty elements on index [2] and [4] - these elements contain one space. That's because the pattern in strsplit(MWE, "(?= Table \\d+\\.)", perl=TRUE) matches a delimiter of length zero, since it contains solely a zero-width positive lookahead assertion and no actual delimiter character item; strsplit would go into an infinite loop if it strictly followed its documented algorithm

    repeat {
        if the string is empty
            break.
        if there is a match
            add the string to the left of the match to the output.
            remove the match and all to the left of it.
        else
            add the string to the output.
            break.
    }

- but there's this special handling in its code:

            /* Match was empty. */
            pt[0] = *bufp;
            pt[1] = '\0';
            bufp++;

This causes one character at the position of an empty match to be returned (the space in your case) and the search to be continued after it.

The solution is simple: Don't use only a zero-width assertion as the pattern; instead, change it slightly by moving the delimiting space out of the assertion:

strsplit(MWE, " (?=Table \\d+\\.)", perl=TRUE)

来源：https://stackoverflow.com/questions/58596135/r-regex-lookbehind-with-a-long-expression

标签

regex

string

split