问题
One of the sql logic is moving to backend and I need to generate a report using shell scripting. For understanding, I'm making it simple as follows.
My input file - sales.txt (id, price, month)
101,50,2019-10
101,80,2020-08
101,80,2020-10
201,100,2020-09
201,350,2020-10
The output should be for 6 months window for each id e.g t1=2020-07 and t2=2020-12
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
For id 101
, though there is no entry for 2020-07, it should take from the immediate previous month value that is available in the sales file.
So the price=50 from 2019-10 is used for 2020-07.
For 201
, the first entry itself is from 2020-09, so 2020-08 and 2020-07 are not applicable.
Wherever there are gaps the immediate previous month value should be propagated.
I'm trying to use awk to solve this problem, I'm creating a reusable script util.awk like below to generate the missing values, pipe it to sort command and then again use the util.awk for final output.
util.awk
function get_month(a,b,t1) { return strftime("%Y%m",mktime(a " " b t1)) }
BEGIN { ss=" 0 0 0 "; ts1=" 1 " ss; ts2=" 35 " ss ; OFS="," ; x=1 }
{
tsc=get_month($3,$4,ts1);
if ( NR>1 && $1==idp )
{
if( tsc == tsp) { print $1,$2,get_month($3,$4,ts1); x=0 }
else { for(i=tsp; i < tsc; i=get_month(j1,j2,i) )
{
j1=substr(i,1,4); j2=substr(i,5,2);
print $1,tpr,i;
}
}
}
tsp=get_month($3,$4,ts2);
idp=$1;
tpr=$2;
if(x!=0) print $1,$2,tsc
x=1;
}
But it is running infinitely awk -F"[,-]" -f utils.awk sales.txt
Though I tried in awk, I welcome other answers as well that would work in bash environment.
回答1:
General plan:
- assumption:
sales.txt
is already sorted (numerically) by the first column - user provides the min->max date range to be displayed (
awk
variablesmindt
andmaxdt
) - for a distinct
id
value we'll load all prices and dates into an array (prices[]
) - dates will be used as the indices of an associative array to store prices (
prices[YYYY-MM]
) - once we've read all records for a given
id
... - sort the
prices[]
array by the indices (ie, sort byYYYY-MM
) - find the price for the max date less than
mindt
(save asprevprice
) - for each date between
mindt
andmaxdt
(inclusive), if we have a price then display it (and save asprevprice
) else ... - if we don't have a price but we do have a
prevprice
then use thisprevprice
as the current date'sprice
(ie, fill the gap with the previous price)
One (GNU) awk
idea:
mindate='2020-07'
maxdate='2020-12'
awk -v mindt="${mindate}" -v maxdt="${maxdate}" -v OFS=',' -F',' '
# function to add "months" (number) to "indate" (YYYY-MM)
function add_month(indate,months) {
dhms="1 0 0 0" # default day/hr/min/secs
split(indate,arr,"-")
yr=arr[1]
mn=arr[2]
return strftime("%Y-%m", mktime(arr[1]" "(arr[2]+months)" "dhms))
}
# function to print the list of prices for a given "id"
function print_id(id) {
if ( length(prices) == 0 ) # if prices array is empty then do nothing (ie, return)
return
PROCINFO["sorted_in"]="@ind_str_asc" # sort prices[] array by index in ascending order
for ( i in prices ) # loop through indices (YYYY-MM)
{ if ( i < mindt ) # as long as less than mindt
prevprice=prices[i] # save the price
else
break # no more pre-mindt indices to process
}
for ( i=mindt ; i<=maxdt ; i=add_month(i,1) ) # for our mindt - maxdt range
{ if ( !(i in prices) && prevprice ) # if no entry in prices[], but we have a prevprice, then ...
prices[i]=prevprice # set prices[] to prevprice (ie, fill the gap)
if ( i in prices ) # if we have an entry in prices[] then ...
{ prevprice=prices[i] # update prevprice (for filling future gap) and ...
print id,prices[i],i # print our data to stdout
}
}
}
BEGIN { split("",prices) } # pre-declare prices as an array
previd != $1 { print_id(previd) # when id changes print the prices[] array, then ...
previd=$1 # reset some variables for processing of the next id and ...
prevprice=""
delete prices # delete the prices[] array
}
{ prices[$3]=$2 } # for the current record create an entry in prices[]
END { print_id(previd) } # flush the last set of prices[] to stdout
' sales.txt
NOTE: This assumes sales.txt
is sorted (numerically) by the first field; if this is not true then the last line should be changed to ' <(sort -n sales.txt)
This generates:
101,50,2020-07
101,80,2020-08
101,80,2020-09
101,80,2020-10
101,80,2020-11
101,80,2020-12
201,100,2020-09
201,350,2020-10
201,350,2020-11
201,350,2020-12
回答2:
I hope I understood your question a bit. The following awk should do the trick
$ awk -v t1="2020-07" -v d="6" '
function next_month(d,a) {
split(d,a,"-"); a[2]==12?a[1]++ && a[2]=1 : a[2]++
return sprintf("%0.4d-%0.2d",a[1],a[2])
}
BEGIN{FS=OFS=",";t2=t1; for(i=1;i<=d;++i) t2=next_month(t2)}
{k[$1]}
($3<t1){a[$1,t1]=$2}
(t1 <= $3 && $3 < t2) { a[$1,$3]=$2 }
END{ for (key in k) {
p=""; t=t1;
for(i=1;i<=d;++i) {
if(p!="" || (key,t) in a) print key, ((key,t) in a ? p=a[key,t] : p), t
t=next_month(t)
}
}
}' input.txt
We implemented a straightforward function next_month
that computes the next month based on a format YYYY-MM
. Based on the duration of d
months, we compute the time-period that should be shown in the BEGIN
block. The time-period of interest is t1 <= t < t2
.
Every time we read a record/line, we keep track of the key that he's been processed and store it in the array k
. This way we know which key has been seen up to this point.
for all the times before the time-period of interest, we store the value in an array a
with index (key,t1)
, while for all other times, we store the value in the array a
with key (key,$3)
.
When the file is fully processed, we just cycle over all keys and print the output. We used a bit of logic, to check whether or not the month was listed in the original file.
Note: the output will be per key sorted in time, but the key will not appear in the same order as in the original file.
来源:https://stackoverflow.com/questions/65345478/moving-sql-logic-to-backend-bash