问题
i am opening this question because it seems my original question requires a new direction: my original question
i would like to create a regular expression that can extract STATIC MESSAGE and DYNAMIC MESSAGE from the following types of log-entries:
/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message
/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message
one log entry type has a simple structure:
file:date TYPE STATIC;DYNAMIC
the other is not so simple when trying to be parsed with regex:
file:date MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC
where the MODULE.NAME
and CONNECTION.OR.THREAD
are either both present or not present.
my regular expression so far which works on the first type of log entry is:
(?:.*?):(?:\w{3} \d{1,2} \d{1,2}:\d{1,2}:\d{1,2})(?:\s+?)(?:[\S|\.]*?(?:\s*?))?(?:(?:TYPE1)|(?:TYPE2)|(?:TYPE3))(?:\s+?)(?:\S+?(?:\s+?))?(.+){1}(?:;(.+)){1}
but whenever i get to the second type of entry, i am also getting the CONNECTION.OR.THREAD as part of my first capturing group.
i am hoping for a way to use the lookahead or lookbehind feature so that i can capture STATIC
and DYNAMIC
and ignore the CONNECTION.OR.THREAD
part if there is a MODULE.NAME
?
i hope this question is clear, please refer to my original if it seems a bit bleak. thank you.
EDIT: for clarification. every line of the log is different then the others, each line starts with a filepath, then a :
then the date, in the following format: MMM DD HH:MM:SS
and then it gets tricky, either a MODULE.NAME
which varies, followed by the TYPE
which also varies, followed by CONNECTION.OR.THREAD
which varies, or with just the TYPE
. after which there is the STATIC MESSAGE
then a ;
then a DYNAMIC MESSAGE
both the static and dynamic message vary, the usage of the term STATIC
is simply because an error can be for instance "unable to connect to server; server1.com" so the static part of the error is "unable to connect to server" and the dynamic part is "server1.com"
回答1:
at the moment i have made this massive regex:
(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)(?:(?:(?:(?:TYPE1)|(?:(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))|(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1})))
i will split it into parts:
FILE/DATE + SPACE:
(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)
and then EITHER:
SIMPLE: (TYPE STATIC;DYNAMIC)
(?:(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))
OR COMPLEX: (MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC)
(:?(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1}))
it does the trick. but its huge and i think it can be improved. so please if anyone can improve it, please do.
EDIT:
there is a problem though. because now there are 4 capturing groups. so i can not know ahead of time if i must look in captured[0:1] or captured[2:3] for my results. anyone have a way to do this that i will not have to check each time if i have something there? or perhaps a way to eliminate empty capturing groups from results, or maybe to only get non-empty results from the list of results? something? my brain is fried.
EDIT2:
as @martijn pieters suggested i removed the extraneous grouping this is my current regex:
.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3)\s+?(.+){1};(.+){1})|(?:\S+(?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+\s+?(.+){1};(.+){1}))
which works fine. i am concerned about (?:TYPE1|TYPE2|TYPE3)
being miss-interpreted as TYPE(1|T)YPE(2|T)YPE3
any insight would be appreciated.
also, how best to go about parsing my results - seeing as i will get a list of 4 items with either the first 2 or the second 2 being empty and the other having my static/dynamic results.
EDIT3:
okay, i have done a hybrid solution. i have remade my regular expression:
.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3))|(?:\S+(?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+))\s+(.*)
i now only have 1 capture group, which is the STATIC;DYNAMIC part. once i get this i do what i was doing before (see my previous question)
for item in captured:
parts = item.split(";")
static = parts[0]
dynamic = ";".join(parts[1:])
that is my solution. thank you @Martijn Pieters especially for your help. i hope this can help someone in the future.
来源:https://stackoverflow.com/questions/12248696/python-regex-conditional-substring-extraction