I need to save data in a table (for reporting, stats etc...) so a user can search by time, user agent etc. I have a script that runs every day that reads the Apache Log and
As I've seen and done so many errneous log parsing, here is a hopefully valid regex, tested on 50k lines of logs without any single diff, knowing that:
It's hard to distinguish between referrer and user-agent, let's just home the " " between both is discriminent enough, yet we can find the infamous " " in the referrer and in the user-agent, so basically, we're screwed here.
$ncsa_re = '/^(?P\S+)
\ (?P\S)
\ (?P.*?) # Spaces are allowed here, can be empty.
\ (?P\[[^]]+\])
\ "(?P.+ .+)" # At least one space: HTTP 0.9
\ (?P[0-9]+) # Status code is _always_ an integer
\ (?P(?:[0-9]+|-)) # Response size can be -
\ "(?P.*)" # Referrer can contains everything: its just a header
\ "(?P.*)"$/x';
Hope that's help.