问题
One of the sql logic is moving to backend and I need to generate a report using shell scripting. For understanding, I'm making it simple as follows.
My input file - sales.txt (id, price, month)
101,50,2019-10
101,80,2020-08
101,80,2020-10
201,100,2020-09
201,350,2020-10
The output should be for 6 months window for each id e.g t1=2020-07 and t2=2020-12
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
For id 101, though there is no entry for 2020-07, it should take from the immediate previous month value that is available in the sales file.
So the price=50 from 2019-10 is used for 2020-07.
For 201, the first entry itself is from 2020-09, so 2020-08 and 2020-07 are not applicable.
Wherever there are gaps the immediate previous month value should be propagated.
I'm trying to use awk to solve this problem, I'm creating a reusable script util.awk like below to generate the missing values, pipe it to sort command and then again use the util.awk for final output.
util.awk
function get_month(a,b,t1) { return strftime("%Y%m",mktime(a " " b t1)) }
BEGIN { ss=" 0 0 0 "; ts1=" 1 " ss; ts2=" 35 " ss ; OFS="," ; x=1 }
{
tsc=get_month($3,$4,ts1);
if ( NR>1 && $1==idp )
{
if( tsc == tsp) { print $1,$2,get_month($3,$4,ts1); x=0 }
else { for(i=tsp; i < tsc; i=get_month(j1,j2,i) )
{
j1=substr(i,1,4); j2=substr(i,5,2);
print $1,tpr,i;
}
}
}
tsp=get_month($3,$4,ts2);
idp=$1;
tpr=$2;
if(x!=0) print $1,$2,tsc
x=1;
}
But it is running infinitely awk -F"[,-]" -f utils.awk sales.txt
Though I tried in awk, I welcome other answers as well that would work in bash environment.
回答1:
General plan:
- assumption:
sales.txtis already sorted (numerically) by the first column - user provides the min->max date range to be displayed (
awkvariablesmindtandmaxdt) - for a distinct
idvalue we'll load all prices and dates into an array (prices[]) - dates will be used as the indices of an associative array to store prices (
prices[YYYY-MM]) - once we've read all records for a given
id... - sort the
prices[]array by the indices (ie, sort byYYYY-MM) - find the price for the max date less than
mindt(save asprevprice) - for each date between
mindtandmaxdt(inclusive), if we have a price then display it (and save asprevprice) else ... - if we don't have a price but we do have a
prevpricethen use thisprevpriceas the current date'sprice(ie, fill the gap with the previous price)
One (GNU) awk idea:
mindate='2020-07'
maxdate='2020-12'
awk -v mindt="${mindate}" -v maxdt="${maxdate}" -v OFS=',' -F',' '
# function to add "months" (number) to "indate" (YYYY-MM)
function add_month(indate,months) {
dhms="1 0 0 0" # default day/hr/min/secs
split(indate,arr,"-")
yr=arr[1]
mn=arr[2]
return strftime("%Y-%m", mktime(arr[1]" "(arr[2]+months)" "dhms))
}
# function to print the list of prices for a given "id"
function print_id(id) {
if ( length(prices) == 0 ) # if prices array is empty then do nothing (ie, return)
return
PROCINFO["sorted_in"]="@ind_str_asc" # sort prices[] array by index in ascending order
for ( i in prices ) # loop through indices (YYYY-MM)
{ if ( i < mindt ) # as long as less than mindt
prevprice=prices[i] # save the price
else
break # no more pre-mindt indices to process
}
for ( i=mindt ; i<=maxdt ; i=add_month(i,1) ) # for our mindt - maxdt range
{ if ( !(i in prices) && prevprice ) # if no entry in prices[], but we have a prevprice, then ...
prices[i]=prevprice # set prices[] to prevprice (ie, fill the gap)
if ( i in prices ) # if we have an entry in prices[] then ...
{ prevprice=prices[i] # update prevprice (for filling future gap) and ...
print id,prices[i],i # print our data to stdout
}
}
}
BEGIN { split("",prices) } # pre-declare prices as an array
previd != $1 { print_id(previd) # when id changes print the prices[] array, then ...
previd=$1 # reset some variables for processing of the next id and ...
prevprice=""
delete prices # delete the prices[] array
}
{ prices[$3]=$2 } # for the current record create an entry in prices[]
END { print_id(previd) } # flush the last set of prices[] to stdout
' sales.txt
NOTE: This assumes sales.txt is sorted (numerically) by the first field; if this is not true then the last line should be changed to ' <(sort -n sales.txt)
This generates:
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
回答2:
I hope I understood your question a bit. The following awk should do the trick
$ awk -v t1="2020-07" -v d="6" '
function next_month(d,a) {
split(d,a,"-"); a[2]==12?a[1]++ && a[2]=1 : a[2]++
return sprintf("%0.4d-%0.2d",a[1],a[2])
}
BEGIN{FS=OFS=",";t2=t1; for(i=1;i<=d;++i) t2=next_month(t2)}
{k[$1]}
($3<t1){a[$1,t1]=$2}
(t1 <= $3 && $3 < t2) { a[$1,$3]=$2 }
END{ for (key in k) {
p=""; t=t1;
for(i=1;i<=d;++i) {
if(p!="" || (key,t) in a) print key, ((key,t) in a ? p=a[key,t] : p), t
t=next_month(t)
}
}
}' input.txt
We implemented a straightforward function next_month that computes the next month based on a format YYYY-MM. Based on the duration of d months, we compute the time-period that should be shown in the BEGIN block. The time-period of interest is t1 <= t < t2.
Every time we read a record/line, we keep track of the key that he's been processed and store it in the array k. This way we know which key has been seen up to this point.
for all the times before the time-period of interest, we store the value in an array a with index (key,t1), while for all other times, we store the value in the array a with key (key,$3).
When the file is fully processed, we just cycle over all keys and print the output. We used a bit of logic, to check whether or not the month was listed in the original file.
Note: the output will be per key sorted in time, but the key will not appear in the same order as in the original file.
来源:https://stackoverflow.com/questions/65345478/moving-sql-logic-to-backend-bash