moving sql logic to backend - bash

问题

One of the sql logic is moving to backend and I need to generate a report using shell scripting. For understanding, I'm making it simple as follows.

My input file - sales.txt (id, price, month)

101,50,2019-10
101,80,2020-08
101,80,2020-10
201,100,2020-09
201,350,2020-10

The output should be for 6 months window for each id e.g t1=2020-07 and t2=2020-12

101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12

For id 101, though there is no entry for 2020-07, it should take from the immediate previous month value that is available in the sales file. So the price=50 from 2019-10 is used for 2020-07.

For 201, the first entry itself is from 2020-09, so 2020-08 and 2020-07 are not applicable. Wherever there are gaps the immediate previous month value should be propagated.

I'm trying to use awk to solve this problem, I'm creating a reusable script util.awk like below to generate the missing values, pipe it to sort command and then again use the util.awk for final output.

util.awk

function get_month(a,b,t1) { return strftime("%Y%m",mktime(a " " b t1)) } 
BEGIN { ss=" 0 0 0 "; ts1=" 1 " ss; ts2=" 35 " ss ; OFS="," ; x=1 } 
{ 
  tsc=get_month($3,$4,ts1);
  if ( NR>1 && $1==idp )
  {
  if( tsc == tsp) { print $1,$2,get_month($3,$4,ts1); x=0  }
  else { for(i=tsp; i < tsc; i=get_month(j1,j2,i) )  
         { 
       j1=substr(i,1,4); j2=substr(i,5,2); 
           print $1,tpr,i;
         }
       }
   }

  tsp=get_month($3,$4,ts2);  
  idp=$1;
  tpr=$2;
  if(x!=0) print $1,$2,tsc
  x=1;
  
}

But it is running infinitely awk -F"[,-]" -f utils.awk sales.txt

Though I tried in awk, I welcome other answers as well that would work in bash environment.

回答1:

General plan:

assumption: sales.txt is already sorted (numerically) by the first column
user provides the min->max date range to be displayed (awk variables mindt and maxdt)
for a distinct id value we'll load all prices and dates into an array (prices[])
dates will be used as the indices of an associative array to store prices (prices[YYYY-MM])
once we've read all records for a given id ...
sort the prices[] array by the indices (ie, sort by YYYY-MM)
find the price for the max date less than mindt (save as prevprice)
for each date between mindt and maxdt (inclusive), if we have a price then display it (and save as prevprice) else ...
if we don't have a price but we do have a prevprice then use this prevprice as the current date's price (ie, fill the gap with the previous price)

One (GNU) awk idea:

mindate='2020-07'
maxdate='2020-12'

awk -v mindt="${mindate}" -v maxdt="${maxdate}" -v OFS=',' -F',' ' 

# function to add "months" (number) to "indate" (YYYY-MM)

function add_month(indate,months) {

    dhms="1 0 0 0"                                     # default day/hr/min/secs
    split(indate,arr,"-")
    yr=arr[1]
    mn=arr[2]

    return strftime("%Y-%m", mktime(arr[1]" "(arr[2]+months)" "dhms))
}

# function to print the list of prices for a given "id"

function print_id(id) {

    if ( length(prices) == 0 )                         # if prices array is empty then do nothing (ie, return)
       return

    PROCINFO["sorted_in"]="@ind_str_asc"               # sort prices[] array by index in ascending order

    for ( i in prices )                                # loop through indices (YYYY-MM)
        { if ( i < mindt )                             # as long as less than mindt
             prevprice=prices[i]                       # save the price
          else
             break                                     # no more pre-mindt indices to process
        }

    for ( i=mindt ; i<=maxdt ; i=add_month(i,1) )     # for our mindt - maxdt range
        { if ( !(i in prices) && prevprice )          # if no entry in prices[], but we have a prevprice, then ...
             prices[i]=prevprice                      # set prices[] to prevprice (ie, fill the gap)

          if ( i in prices )                          # if we have an entry in prices[] then ...
             { prevprice=prices[i]                    # update prevprice (for filling future gap) and ...
               print id,prices[i],i                   # print our data to stdout
             }
        }
}

BEGIN { split("",prices) }                             # pre-declare prices as an array

previd != $1 { print_id(previd)                        # when id changes print the prices[] array, then ...
               previd=$1                               # reset some variables for processing of the next id and ...
               prevprice=""
               delete prices                           # delete the prices[] array
             }

             { prices[$3]=$2 }                         # for the current record create an entry in prices[]

END   { print_id(previd) }                             # flush the last set of prices[] to stdout
' sales.txt

NOTE: This assumes sales.txt is sorted (numerically) by the first field; if this is not true then the last line should be changed to ' <(sort -n sales.txt)

This generates:

101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12

回答2:

I hope I understood your question a bit. The following awk should do the trick

$ awk -v t1="2020-07" -v d="6" '
     function next_month(d,a) { 
         split(d,a,"-"); a[2]==12?a[1]++ && a[2]=1 : a[2]++
         return sprintf("%0.4d-%0.2d",a[1],a[2])
     } 
     BEGIN{FS=OFS=",";t2=t1; for(i=1;i<=d;++i) t2=next_month(t2)}
     {k[$1]}
     ($3<t1){a[$1,t1]=$2}
     (t1 <= $3 && $3 < t2) { a[$1,$3]=$2 }
     END{ for (key in k) {
            p=""; t=t1; 
            for(i=1;i<=d;++i) { 
               if(p!="" || (key,t) in a) print key, ((key,t) in a ? p=a[key,t] : p), t
               t=next_month(t)
            }
          }
     }' input.txt

We implemented a straightforward function next_month that computes the next month based on a format YYYY-MM. Based on the duration of d months, we compute the time-period that should be shown in the BEGIN block. The time-period of interest is t1 <= t < t2.

Every time we read a record/line, we keep track of the key that he's been processed and store it in the array k. This way we know which key has been seen up to this point.

for all the times before the time-period of interest, we store the value in an array a with index (key,t1), while for all other times, we store the value in the array a with key (key,$3).

When the file is fully processed, we just cycle over all keys and print the output. We used a bit of logic, to check whether or not the month was listed in the original file.

Note: the output will be per key sorted in time, but the key will not appear in the same order as in the original file.

来源：https://stackoverflow.com/questions/65345478/moving-sql-logic-to-backend-bash

标签

bash

shell

perl

awk