How to select date range in awk

别说谁变了你拦得住时间么 提交于 2019-12-08 04:03:55

问题


We are making a utility to ssh to different servers and collect all the error logs and send to the concerning teams this utility will cat the log file and filter using awk. e.g.

cat /app1/apache/tomcat7/logs/catalina.out | awk '$0>=from&&$0<=to' from="2019-02-01 12:00" to="2019-11-19 04:50"

We are saving dates in the database for last time loaded and using this date as from date in the next run.

Problem

awk date range given seems to be only working with yyyy-mm-dd HH:MM date format. Our log files have different date formats. e.g.

EEE MMM dd yy HH:mm
EEE MMM dd HH:mm
yyyy-MM-dd hh:mm
dd MMM yyyy HH:mm:ss
dd MMM yyyy HH:mm:ss

Question

How can write awk date filter to work any date format used in log files?

We cannot use perl/python on server. The requirement is to use only cat/awk/grep for this.

Sample Input:

Sat Nov 02 13:07:48.005 2019 NA for id 536870914 in form Request
Tue Nov 05 13:07:48.009 2019 NA for id 536870914 in form Request
Sun Nov 10 16:29:22.122 2019 ERROR (1587): Unknown field ;  at position 177 (category)
Mon Nov 11 16:29:22.125 2019 ERROR (1587): Unknown field ;  at position 174 (category)
Tue Nov 12 07:59:48.751 2019 ERROR (1587): Unknown field ;  at position 177 (category)
Thu Nov 14 10:07:41.792 2019 ERROR (1587): Unknown field ;  at position 177 (category)
Sun Nov 17 08:45:22.210 2019 ERROR (1587): Unknown field ;  at position 174 (category)

Command and filter:

cat error.log |awk '$0>=from&&$0<=to' from="Nov 16 10:58" to="Nov 19 04:50"

Expected output:

Sun Nov 17 08:45:22.210 2019 ERROR (1587): Unknown field ;  at position 174 (category)

回答1:


The answer is that awk does not have any knowledge of what a date is. Awk knows numbers and strings and can only compare those. So when you want to select dates and times you have to ensure that the date-format you compare is sortable and there are many formats out there:

| type       | example                   | sortable |
|------------+---------------------------+----------|
| ISO-8601   | 2019-11-19T10:05:15       | string   |
| RFC-2822   | Tue, 19 Nov 2019 10:05:15 | not      |
| RFC-3339   | 2019-11-19 10:05:15       | string   |
| Unix epoch | 1574157915                | numeric  |
| AM/PM      | 2019-11-19 10:05:15 am    | not      |
| MM/DD/YYYY | 11/19/2019 10:05:15       | not      |
| DD/MM/YYYY | 19/11/2019 10:05:15       | not      |

So you would have to convert your non-sortable formats into a sortable format, mainly using string manipulations. A template awk program that would achieve what you want is written down here:

# function to convert a string into a sortable format
function convert_date(str) {
    return sortable_date
}
# function to extract the date from the record
function extract_date(str) {
    return extracted_date
}
# convert the range
(FNR==1) { t1 = convert_date(begin); t2 = convert_date(end) }
# extract the date from the record
{ date_string = extract_date($0) }
# convert the date of the record
{ t = convert_date(date_string) }
# make the selection
(t1 <= t && t < t2) { print }

most of the time, this program can be heavily reduced. If the above is stored in extract_date_range.awk, you could run it as:

$ awk -f extract_date_range.awk begin="date-in-know-format" end="date-in-known-format" logfile

note: the above assumes single-line log-entries. With a minor adaptation, you can process multi-line log-entries.


In the original problem, the following formats were presented:

EEE MMM dd yy HH:mm         # not sortable
EEE MMM dd HH:mm            # not sortable
yyyy-MM-dd hh:mm            # sortable
dd MMM yyyy HH:mm:ss        # not sortable

From the above, all but the second format can be easily converted to a sortable format. The second format misses the Year by which we would have to do an elaborate check making use of the day of the week. This is extremely difficult and never 100% bullet proof.

Excluding the second format, we can write the following functions:

BEGIN {
    datefmt1="^[a-Z][a-Z][a-Z] [a-Z][a-Z][a-Z] [0-9][0-9] [0-9][0-9] [0-9][0-9]:[0-9][0-9]"
    datefmt3="^[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9] [0-9][0-9]:[0-9][0-9]"
    datefmt4="^[0-9][0-9] [a-Z][a-Z][a-Z] [0-9][0-9][0-9][0-9] [0-9][0-9]:[0-9][0-9]:[0-9][0-9]"
}
# convert the range
(FNR==1) { t1 = convert_date(begin); t2 = convert_date(end) }
# extract the date from the record
{ date_string = extract_date($0) }
# skip if date string is empty
(date_string == "") { next }
# convert the date of the record
{ t = convert_date(date_string) }
# make the selection
(t1 <= t && t < t2) { print }

# function to extract the date from the record
function extract_date(str,    date_string) {
    date_string=""
    if (match(datefmt1,str)) { date_string=substr(str,RSTART,RLENGTH) }
    else if (match(datefmt3,str)) { date_string=substr(str,RSTART,RLENGTH) }
    else if (match(datefmt4,str)) { date_string=substr(str,RSTART,RLENGTH) }
    return date_string
}
# function to convert a string into a sortable format
# converts it in the format YYYYMMDDhhmmss
function convert_date(str, a,fmt, YYYY,MM,DD,T, sortable_date) {
    sortable_date=""
    if (match(datefmt1,str)) { 
        split(str,a,"[ ]")
        YYYY=(a[4] < 70 ? "19" : "20")a[4]
        MM=get_month(a[2]); DD=a[3]
        T=a[5]; gsub(/[^0-9]/,T)"00"
        sortable_date = YYYY MM DD T
    }
    else if (match(datefmt3,str)) { 
        sortable_date = str"00"
        gsub(/[^0-9]/,sortable_date)
    }
    else if (match(datefmt4,str)) { 
        split(str,a,"[ ]")
        YYYY=a[3]
        MM=get_month(a[2]); DD=a[1]
        T=a[4]; gsub(/[^0-9]/,T)"00"
        sortable_date = YYYY MM DD T
    }
    return sortable_date
}
# function to convert Jan->01, Feb->02, Mar->03 ... Dec->12
function get_month(str) {
   return sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",str)+2)/3)
}




回答2:


While technically you can invoke date from awk, this approach will be of limited help:

  • Calling date (or other program) from awk is expensive (starting a process, etc.). If the log files are large, processing will be slow
  • Looks like you are looking for a 'one-liner' that can be executed on the remote server. Handling multiple formats will require more than one-liner.

Consider lifting those constraints - one (or more) of the following:

  • Transfer the complete log file to a machine capable of running the filter locally, supporting the multiple dates.
  • Sending a more complex script to perform the scanning on each remote server. This will require slightly more setup, but will eliminate the need to transfer complete log files over ssh.
  • Customize the log file - catalina, Apache, etc, allow you to control date format. Make all of them produce YYYY-MM-DD HH:MM, or similar.


来源:https://stackoverflow.com/questions/58925993/how-to-select-date-range-in-awk

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!