python regex conditional substring extraction

隐身守侯 提交于 2019-12-11 09:36:49

问题


i am opening this question because it seems my original question requires a new direction: my original question

i would like to create a regular expression that can extract STATIC MESSAGE and DYNAMIC MESSAGE from the following types of log-entries:

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message

one log entry type has a simple structure:

file:date TYPE STATIC;DYNAMIC

the other is not so simple when trying to be parsed with regex:

file:date MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC

where the MODULE.NAME and CONNECTION.OR.THREAD are either both present or not present.

my regular expression so far which works on the first type of log entry is:

(?:.*?):(?:\w{3} \d{1,2} \d{1,2}:\d{1,2}:\d{1,2})(?:\s+?)(?:[\S|\.]*?(?:\s*?))?(?:(?:TYPE1)|(?:TYPE2)|(?:TYPE3))(?:\s+?)(?:\S+?(?:\s+?))?(.+){1}(?:;(.+)){1}

but whenever i get to the second type of entry, i am also getting the CONNECTION.OR.THREAD as part of my first capturing group.

i am hoping for a way to use the lookahead or lookbehind feature so that i can capture STATIC and DYNAMIC and ignore the CONNECTION.OR.THREAD part if there is a MODULE.NAME ?

i hope this question is clear, please refer to my original if it seems a bit bleak. thank you.

EDIT: for clarification. every line of the log is different then the others, each line starts with a filepath, then a : then the date, in the following format: MMM DD HH:MM:SS and then it gets tricky, either a MODULE.NAME which varies, followed by the TYPE which also varies, followed by CONNECTION.OR.THREAD which varies, or with just the TYPE. after which there is the STATIC MESSAGE then a ; then a DYNAMIC MESSAGE both the static and dynamic message vary, the usage of the term STATIC is simply because an error can be for instance "unable to connect to server; server1.com" so the static part of the error is "unable to connect to server" and the dynamic part is "server1.com"


回答1:


at the moment i have made this massive regex:

(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)(?:(?:(?:(?:TYPE1)|(?:(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))|(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1})))

i will split it into parts:

FILE/DATE + SPACE:

(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)

and then EITHER:

SIMPLE: (TYPE STATIC;DYNAMIC)

(?:(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))

OR COMPLEX: (MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC)

(:?(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1}))

it does the trick. but its huge and i think it can be improved. so please if anyone can improve it, please do.

EDIT:

there is a problem though. because now there are 4 capturing groups. so i can not know ahead of time if i must look in captured[0:1] or captured[2:3] for my results. anyone have a way to do this that i will not have to check each time if i have something there? or perhaps a way to eliminate empty capturing groups from results, or maybe to only get non-empty results from the list of results? something? my brain is fried.

EDIT2:

as @martijn pieters suggested i removed the extraneous grouping this is my current regex:

.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3)\s+?(.+){1};(.+){1})|(?:\S+(‌​?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+\s+?(.+){1};(.+){1}))

which works fine. i am concerned about (?:TYPE1|TYPE2|TYPE3) being miss-interpreted as TYPE(1|T)YPE(2|T)YPE3 any insight would be appreciated.

also, how best to go about parsing my results - seeing as i will get a list of 4 items with either the first 2 or the second 2 being empty and the other having my static/dynamic results.

EDIT3:

okay, i have done a hybrid solution. i have remade my regular expression:

.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3))|(?:\S+(?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+))\s+(.*)

i now only have 1 capture group, which is the STATIC;DYNAMIC part. once i get this i do what i was doing before (see my previous question)

for item in captured:
    parts = item.split(";")
    static = parts[0]
    dynamic = ";".join(parts[1:])

that is my solution. thank you @Martijn Pieters especially for your help. i hope this can help someone in the future.



来源:https://stackoverflow.com/questions/12248696/python-regex-conditional-substring-extraction

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!