very specific substring retrieval and split

半腔热情 提交于 2020-01-16 05:27:09

问题


i know there are tons of posts about sub-stringing, believe me i have searched through many of them looking for an answer to this.

i have many strings, lines from a log, and i am trying to categorize and parse them.

they look something like this:

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

where the filename is the file where the log is located, the date is the date/time that the message was put into the log, and the TYPE is the type of message, and then the message is composed of two parts, a static part and a dynamic part, the static part does not change for the message and the dynamic part can change (obviously) and they are split by a ; but there can be more ; in the dynamic part.

i want to be able to extract the Static Message, and the Dynamic Message.

so far i have been using something like this:

parts = line.split(";")
static = parts[0]
dynamic = ";".join(parts[1:])

not very pretty. and also my static part contains the filename and the date and the type, which i do not want. so then i thought i would do something like this:

parts = " ".join(":".join(line.split(":")[1:]).split(" ")[4:]).split(";")
static = parts[0]
dynamic = ";".join(parts[1:])

which i have tried, and it works to some extent, except sometimes the filename might have a space, or the TYPE might have a space or something isnt working properly and i sometimes get the TYPE as part of the static message... efficiency is an issue since these are thousands of lines of logs which must be parsed and categorized daily. so i am wondering if there is a better way to do this other than this hack-job??

edit: i thought i would provide more examples of lines in the log. to fix what i said earlier, there are a few types of entries.

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message

so as you can see - there are some two types of log entries. those without modules and those with, those with modules can either be connected to connections, and some to threads. this makes the parsing harder.


回答1:


You can limit the split to the first ';' only:

static, dynamic = line.split(';', 1)

Your static part splitting might take a little more doing, but if you know the number of spaces is going to be static in the first part, perhaps the same trick could work there:

static = static.split(' ', 4)[-1]

If the first part of the line is more complex (spaces in the TYPE part) I fear that removing everything before that is going to be a more difficult affair. Your best bet is to figure out the limited set of values TYPE could assume and to use a regular expression with that information to split the static part.




回答2:


You could try something like:

>>> regexp = re.compile("^([\/.\w]*)\:(\w{3}\s\d{2}\s\d{2}\:\d{2}\:\d{2})\s([A-Z]*)\s([\w\s]*)\;([\w\s]*)$")
>>> regexp.match(line).groups()
('/long/file/name/with.dots.and.extension', 'Jan 01 12:00:00', 'TYPE', 'Static Message', 'Dynamic Message')


来源:https://stackoverflow.com/questions/12247651/very-specific-substring-retrieval-and-split

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!