问题
I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.
The code that isn't compiling is this:
def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/
What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:
[ERROR] BUILD ERROR [INFO] ------------------------------------------------------------------------ [INFO] line 23:26: unexpected char: 0x0
The line above is line 23
. The surrounding lines are just variable declarations that I haven't changed while working on the regex.
Thanks!
Update: The code compiles, but it's not filtering as I'd expected it to. In regexpal I put the regex:
[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]
and the test data:
name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field
name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field
name='class1'>TS</field><field name='freq'>S</field><field
name='class2'>616.079</field><field name='text'>Subcellular Localization of the
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field
name='page'>89-97</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field
It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.
The code snippet:
def List parseFile(File file){
println "reading File name: ${file.name}"
def lineCount = 0
List data = new ArrayList()
file.eachLine {
String input ->
lineCount ++
String line = input
if(input =~ illegalChars){
line = input.replaceAll(illegalChars, " ")
}
Map document = new HashMap()
elementNames.each(){
token ->
def val = getValue(line, token)
if(val != null){
if(token.equals("ISSUE")){
List entries = val.split(";")
document.putAt("year",entries.getAt(0).trim())
if(entries.size() > 1){
document.putAt("volume", entries.getAt(1).trim())
}
if(entries.size() > 2){
document.putAt("issue", entries.getAt(2).trim())
}
} else {
document.putAt(token, val)
}
}
}
data.add(document)
}
println "done"
return data
}
I don't see any reason that the two should behave differently; am I missing something?
Again, thanks!
回答1:
OK here's my finding:
>>> print "XYZ".replaceAll(
/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
"-"
)
---
>>> print "X\0YZ".replaceAll(
/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
"-"
)
X-YZ
>>> print "X\0YZ".replaceAll(
"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
"-"
)
X-YZ
In other words, my \\uNNNN
answer within /pattern/
is WRONG. What happens is that 0-\
becomes part of the range, and this includes <
, >
and all capital letters.
The \\uNNNN
only works in "pattern"
, not in /pattern/
.
I will edit my official answer based on comments to this "answer".
Related questions
- How to escape Unicode escapes in Groovy’s /pattern/ syntax
回答2:
line 23:26: unexpected char: 0x0
This error message points to this part of the code:
def illegalChars = ~/[\u0000-...
12345678901234567890123
It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:
def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/
Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.
References
- regular-expressions.info/Character Classes
On doubling the slash
Here's the relevant quote from java.util.regex.Pattern
Unicode escape sequences such as
\u2014
in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings"\u2014"
and"\\u2014"
, while not equal, compile into the same pattern, which matches the character with hexadecimal value0x2014
.
To illustrate, in Java:
System.out.println("\n".matches("\\u000A")); // prints "true"
However:
System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is because \u000A
, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:
System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is not a legal Java source code.
回答3:
Try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
回答4:
try
def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`
来源:https://stackoverflow.com/questions/3240356/groovy-regex-illegal-characters