问题
I am writing a String parser that I use to parse all strings from a text file, The strings can be inside single or double quotes, Pretty simple right? well not really. I wrote a regex to match strings how I want. but it's giving me StackOverFlow
error on big strings (I am aware java isn't really good with regex stuff on big strings), This is the regex pattern (['"])(?:(?!\1|\\).|\\.)*\1
This works good for all the string inputs that I need, but as soon as theres a big string it throws StackOverFlow
error, I have read similar questions based on this, such as this which suggests to use StringUtils.substringsBetween
, but that fails on strings like '""'
, "\\\""
So my question is what should I do to solve this issue? I can provide more context if needed, Just comment.
Edit: After testing the answer
Code:
public static void main(String[] args) {
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "local b = { [\"\\\\\"] = \"\\\\\\\\\", [\"\\\"\"] = \"\\\\\\\"\", [\"\\b\"] = \"\\\\b\", [\"\\f\"] = \"\\\\f\", [\"\\n\"] = \"\\\\n\", [\"\\r\"] = \"\\\\r\", [\"\\t\"] = \"\\\\t\" }\n" +
"local c = { [\"\\\\/\"] = \"/\" }";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
Output:
Full match: "\\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t"
Group 1: null
Group 2: \\"] = "\\\\", ["\""] = "\\\"", ["\b"] = "\\b", ["\f"] = "\\f", ["\n"] = "\\n", ["\r"] = "\\r", ["\t"] = "\\t
Full match: "\\/"] = "/"
Group 1: null
Group 2: \\/"] = "/
It's not handling the escaped quotes correctly.
回答1:
I would try without capture quote type/lookahead/backref to improve performance. See this question for escaped characters in quoted strings. It contains a nice answer that is unrolled. Try like
'[^\\']*(?:\\.[^\\']*)*'|"[^\\"]*(?:\\.[^\\"]*)*"
As a Java String:
String regex = "'[^\\\\']*(?:\\\\.[^\\\\']*)*'|\"[^\\\\\"]*(?:\\\\.[^\\\\\"]*)*\"";
The left side handles single quoted, the right double quoted strings. If either kind overbalances the other in your source, put that preferably on the left side of the pipe.
See this a demo at regex101 (if you need to capture what's inside the quotes, use groups)
回答2:
For the overflow state, you would probably want to allocate whatever resources that'd be required. You'd likely want to design small benchmark tests and find out about the practical resources that might be necessary to finalize your task.
Another option would be to find some other strategies or maybe languages to solve your problem. For instance, if you could classify your strings into two categories of '
or "
wrapped to find some other optimal solutions.
Otherwise, you might want to try designing simple expressions and avoid back-referencing, such as with:
'([^']*)'|"(.*)"
which would probably fail for some other inputs that you might have and we don't know of.
Or maybe present your question slightly more technical such that some experienced users might be able to provide better answers, such as this answer.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex = "'([^']*)'|\"(.*)\"";
final String string = "'\"\"'\n"
+ "\"\\\\\\\"\"";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
}
}
Output
Full match: '""'
Group 1: ""
Group 2: null
Full match: "\\\""
Group 1: null
Group 2: \\\"
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
来源:https://stackoverflow.com/questions/58333132/getting-data-between-single-and-double-quotes-special-case