Hadoop Pig Ordered Analytical Functions

我的梦境 提交于 2019-12-11 19:23:29

问题


I am new in Pig and would like to use an ordered analytical function, similar to what is possible in SQL.

My data looks something like this:

(stock_symbol,date,stock_price_open,stock_price_close)
(TAC,2001-08-06,16.39,16.36)
(TAC,2001-08-07,16.3,16.54)
(TAC,2001-08-08,16.55,16.44)
(TAC,2001-08-09,16.45,16.48)
(TAC,2001-08-10,16.5,15.8)

What I want to do is find the change in opening stock price from day to day. So, my output would look something like this:

(stock_symbol,date,stock_price_open,stock_price_close,stock_price_change)
(TAC,2001-08-06,16.39,16.36,NULL)
(TAC,2001-08-07,16.3,16.54,-0.09)
(TAC,2001-08-08,16.55,16.44,0.25)
(TAC,2001-08-09,16.45,16.48,-0.1)
(TAC,2001-08-10,16.5,15.8,0.05)

I want Pig to be able to look at a row ahead or behind the current row. Is this possible, or does Pig not allow for this type of analysis?


回答1:


You can use the below script to get the output as expected, but might be some fine tunning is required.

A = load '/tmp/pig/test/test' using PigStorage (',');
B= foreach A generate $0 as stock_symbol, ToDate($1,'yyyy-mm-dd') as dt,(double)$2 as stock_price_open, (double)$3 as stock_price_close,'PT24H' as dthour;
C= foreach B generate $0 as stock_symbol, $1 as dt_curr, SubtractDuration($1,$4) as dt_old, $2 as stock_price_open, $3 as stock_price_close;
START = FILTER C BY ($1 == $1);
D = JOIN C by $0 , START by $0;
Filter_D = FILTER D by ((DaysBetween($1,$6)==1) and (DaysBetween($2,$7)==1));
E = foreach Filter_D generate $0 as stock_symbol, $1 as dt, $3 as stock_price_open, $4 as stock_price_close, $3-$8 as stock_price_change;

The Output as :

(TAC,2001-01-07T00:08:00.000-08:00,16.3,16.54,-0.08999999999999986)
(TAC,2001-01-08T00:08:00.000-08:00,16.55,16.44,0.25)
(TAC,2001-01-09T00:08:00.000-08:00,16.45,16.48,-0.10000000000000142)
(TAC,2001-01-10T00:08:00.000-08:00,16.5,15.8,0.05000000000000071)

As we required to calculate the One Day Older Opening Date so have taking variable "PT24H" which defined 24 Hours in Pig. The Same was printed in next action by using ToDate() & SubtractDuration(), follwed by a Join and DaysBetween() action to get the difference.

ToDate(),SubtractDuration(),DaysBetween() are inbilt Function in PIG UDF, u can write suitable UDF for fine tuning the same script, with more proper action.



来源:https://stackoverflow.com/questions/31730249/hadoop-pig-ordered-analytical-functions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!