问题
I'm using regular expressions to parse logs. I was previously reading the File into a string array, and then iterating through the string array appending if I don't match the timestamp, otherwise I add the line I'm iterating on to a variable and continue the search. Once I get a complete log entry, I use another regular expression to parse it.
Scanning file
try {
List<String> lines = Files.readAllLines(filepath);
Pattern pattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}");
Matcher matcher;
String currentEntry = "";
for(String line : lines) {
matcher = pattern.matcher(line);
// If this is a new entry, then wrap up the previous one and start again
if ( matcher.lookingAt() ) {
// If the previous entry was not empty
if(!StringUtils.trimWhitespace(currentEntry).isEmpty()) {
entries.add(new LogEntry(currentEntry));
}
// Clear the current entry
currentEntry = "";
}
if (!currentEntry.trim().isEmpty())
currentEntry += "\n";
currentEntry += line;
}
// At the end, if we have one leftover entry, add it
if (!currentEntry.isEmpty()) {
entries.add(new LogEntry(currentEntry));
}
}catch (Exception ex){
return null;
}
Parsing entry
final private static String timestampRgx = "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})";
final private static String levelRgx = "(?<level>(?>INFO|ERROR|WARN|TRACE|DEBUG|FATAL))";
final private static String classRgx = "\\[(?<class>[^]]+)\\]";
final private static String threadRgx = "\\[(?<thread>[^]]+)\\]";
final private static String textRgx = "(?<text>.*)";
private static Pattern PatternFullLog = Pattern.compile(timestampRgx + " " + levelRgx + "\\s+" + classRgx + "-" + threadRgx + "\\s+" + textRgx + "$", Pattern.DOTALL);
public LogEntry(String logText) {
try {
Matcher matcher = PatternFullLog.matcher(logText);
matcher.find();
String dateStr = matcher.group("timestamp");
timestamp = new DateLogLevel();
timestamp.parseLogDate(dateStr);
String levelStr = matcher.group("level");
loglevel = LOG_LEVEL.valueOf(levelStr);
String fullClassStr = matcher.group("class");
String[] classNameArray = fullClassStr.split("\\.");
framework = classNameArray[2];
classname = classNameArray[classNameArray.length - 1];
threadname = matcher.group("thread");
logtext = matcher.group("text");
notes = "";
} catch (Exception ex) {
throw ex;
}
}
What I want to figure out
What I really want to do is read the whole file as a single string, then use a single regex to parse this out line by line, using a single regular expression once. My plan was to use the same expression I use in the constructor, but when looking for the log text make it end at either EOF or the next log line, as such
final String timestampRgx = "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})";
final String levelRgx = "(?<level>(?>INFO|ERROR|WARN|TRACE|DEBUG|FATAL))";
final String classRgx = "\\[(?<class>[^]]+)\\]";
final String threadRgx = "\\[(?<thread>[^]]+)\\]";
final String textRgx = "(?<text>.*[^(\Z|\\d{4}\-\\d{2}\-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})"; // change to handle multiple lines
private static Pattern PatternFullLog = Pattern.compile(timestampRgx + " " + levelRgx + "\\s+" + classRgx + "-" + threadRgx + "\\s+" + textRgx + "$", Pattern.DOTALL);
try {
// Read file into string
String lines = readFile(filepath);
Pattern pattern = Pattern.compile("\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}");
Matcher matcher;
matcher = pattern.matcher(line);
while(matcher.find())
String dateStr = matcher.group("timestamp");
timestamp = new DateLogLevel();
timestamp.parseLogDate(dateStr);
String levelStr = matcher.group("level");
loglevel = LOG_LEVEL.valueOf(levelStr);
String fullClassStr = matcher.group("class");
String[] classNameArray = fullClassStr.split("\\.");
framework = classNameArray[2];
classname = classNameArray[classNameArray.length - 1];
threadname = matcher.group("thread");
logtext = matcher.group("text");
entries.add(
new LogEntry(
timestamp,
loglevel,
framework,
threadname,
logtext,
""/* Notes are empty when importing new file */));
}
}
}catch (Exception ex){
return null;
}
The problem is that I can't seem to get the last group (textRgx) to multiline match until either a timestamp or end of file. Does anyone have any thoughts?
Sample Log Entries
2017-03-14 22:43:14,405 FATAL [org.springframework.web.context.support.XmlWebApplicationContext]-[localhost-startStop-1] Refreshing Root WebApplicationContext: startup date [Tue Mar 14 22:43:14 UTC 2017]; root of context hierarchy
2017-03-14 22:43:14,476 INFO [org.springframework.beans.factory.xml.XmlBeanDefinitionReader]-[localhost-startStop-1] Loading XML bean definitions from Serv
2017-03-14 22:43:14,476 INFO [org.springframework.beans.factory.xml.XmlBeanDefinitionReader]-[localhost-startStop-1] Here is a multiline
log entry with another entry after
2017-03-14 22:43:14,476 INFO [org.springframework.beans.factory.xml.XmlBeanDefinitionReader]-[localhost-startStop-1] Here is a multiline
log entry with no entries after
回答1:
You need to define the patterns like
final static String timestampRgx = "(?<timestamp>\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3})";
final static String levelRgx = "(?<level>INFO|ERROR|WARN|TRACE|DEBUG|FATAL)";
final static String classRgx = "\\[(?<class>[^\\]]+)]";
final static String threadRgx = "\\[(?<thread>[^\\]]+)]";
final static String textRgx = "(?<text>.*?)(?=\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2},\\d{3}|\\Z)";
private static Pattern PatternFullLog = Pattern.compile(timestampRgx + " " + levelRgx + "\\s+" + classRgx + "-" + threadRgx + "\\s+" + textRgx, Pattern.DOTALL);
Then, you may use that like
Matcher matcher = PatternFullLog.matcher(line);
See the Java demo
Here is what the pattern looks like:
(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) (?<level>INFO|ERROR|WARN|TRACE|DEBUG|FATAL)\s+\[(?<class>[^\]]+)]-\[(?<thread>[^\]]+)]\s+(?<text>.*?)(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}|\Z)
See the regex demo.
Some notes:
- You had several issues with escaping symbols (
]inside a character class must be escaped, and\-should have been replaced with- - The pattern to match text up to the datetime or end of string is
(?<text>.*?)(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}|\Z)where.*?matches any char, 0+ occurrences, reluctantly, up to the first occurrence of the timestamp pattern (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2},\d{3}) or end of string (\Z).
来源:https://stackoverflow.com/questions/43121432/regular-expression-log-parsing