Groovy Regex illegal Characters

问题

I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.

The code that isn't compiling is this:

def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/

What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:

[ERROR] BUILD ERROR [INFO] ------------------------------------------------------------------------ [INFO] line 23:26: unexpected char: 0x0

The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.

Thanks!

Update: The code compiles, but it's not filtering as I'd expected it to. In regexpal I put the regex:

[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]

and the test data:

name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field 
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field  
 name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field 
name='class1'>TS</field><field name='freq'>S</field><field 
name='class2'>616.079</field><field name='text'>Subcellular Localization of the 
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast 
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field 
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field   
name='page'>89-97</field><field name='shm'>3146.757500</field><field 
name='pubc'>47</field><field name='cs'>1</field><field

It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.

The code snippet:

    def List parseFile(File file){
    println "reading File name: ${file.name}"
    def lineCount = 0
    List data = new ArrayList()

    file.eachLine {
        String input ->
        lineCount ++
        String line = input
        if(input =~ illegalChars){
            line = input.replaceAll(illegalChars, " ")
        }
        Map document = new HashMap()
        elementNames.each(){
            token ->
            def val = getValue(line, token)
            if(val != null){
                if(token.equals("ISSUE")){
                    List entries = val.split(";")
                    document.putAt("year",entries.getAt(0).trim())
                    if(entries.size() > 1){
                        document.putAt("volume", entries.getAt(1).trim())
                    }
                    if(entries.size() > 2){
                        document.putAt("issue", entries.getAt(2).trim())
                    }
                } else {
                    document.putAt(token, val)
                }
            }
        }
        data.add(document)
    }

    println "done"
    return data
}

I don't see any reason that the two should behave differently; am I missing something?

Again, thanks!

回答1:

OK here's my finding:

>>> print "XYZ".replaceAll(
       /[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
       "-"
    )

---

>>> print "X\0YZ".replaceAll(
       /[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
       "-"
    )

X-YZ

>>> print "X\0YZ".replaceAll(
       "[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
       "-"
    )

X-YZ

In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.

The \\uNNNN only works in "pattern", not in /pattern/.

I will edit my official answer based on comments to this "answer".

回答2:

line 23:26: unexpected char: 0x0

This error message points to this part of the code:

def illegalChars = ~/[\u0000-...
12345678901234567890123

It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:

def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/

Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.

References

regular-expressions.info/Character Classes

On doubling the slash

Here's the relevant quote from java.util.regex.Pattern

Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

To illustrate, in Java:

System.out.println("\n".matches("\\u000A")); // prints "true"

However:

System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"

This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:

System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"

This is not a legal Java source code.

回答3:

Try this Regular Expression to remove unicode char from the string :

/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/

回答4:

try

def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`

来源：https://stackoverflow.com/questions/3240356/groovy-regex-illegal-characters

标签

regex

groovy