question related to this
I have a string
a\\;b\\\\;c;d
which in Java looks like
String s = \"a\\\\;b\\\\\\\\;c;d\"
String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");
This should work.
Explanation :
// (?<!(?<!\\)\\);
//
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
// Match the character “\” literally «\\»
// Match the character “\” literally «\\»
// Match the character “;” literally «;»
So you just match the semicolons not preceded by exactly one \
.
EDIT :
String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");
This will take care of any odd number of . It will of course fail if you have more than 4000000 number of \. Explanation of edited answer :
// (?<!(?<!\\(\\\\){0,2000000})\\);
//
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
// Match the character “\” literally «\\»
// Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
// Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
// Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
// Match the character “\” literally «\\»
// Match the character “\” literally «\\»
// Match the character “\” literally «\\»
// Match the character “;” literally «;»
This approach assumes that your string will not have char '\0'
in your string. If you do, you can use some other char.
public static String[] split(String s) {
String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
for (int i = 0; i < result.length; i++) {
result[i] = result[i].replaceAll("\0", "\\\\;");
}
return result;
}
This is the real answer i think.
In my case i am trying to split using |
and escape character is &
.
final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
System.out.println(Arrays.toString(res));
In this code i am using Lookbehind to escape & character. note that the look behind must have maximum length.
(?<!((?:[^&]|^)(&&){0,10000}&))\\|
this means any |
except those that are following ((?:[^&]|^)(&&){0,10000}&))
and this part means any odd number of &
s.
the part (?:[^&]|^)
is important to make sure that you are counting all of the &
s behind the |
to the beginning or some other characters.
You can use the regex
(?:\\.|[^;\\]++)*
to match all text between unescaped semicolons:
List<String> matchList = new ArrayList<String>();
try {
Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}
Explanation:
(?: # Match either...
\\. # any escaped character
| # or...
[^;\\]++ # any character(s) except semicolon or backslash; possessive match
)* # Repeat any number of times.
The possessive match (++
) is important to avoid catastrophic backtracking because of the nested quantifiers.
I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using C
since it's ages ago I last touched Java
;-)
int i, len, state;
char c;
for (len=myString.size(), state=0, i=0; i < len; i++) {
c=myString[i];
if (state == 0) {
if (c == '\\') {
state++;
} else if (c == ';') {
printf("; at offset %d", i);
}
} else {
state--;
}
}
The advantages are: