Time differences in Apache Pig?

你说的曾经没有我的故事 提交于 2019-12-08 11:44:27

问题


In a Big Data context I have a time series S1=(t1, t2, t3 ...) sorted in an ascending order. I would like to produce a series of time differences: S2=(t2-t1, t3-t2 ...)

  1. Is there a way to do this in Apache Pig? Short of a very inefficient self-join, I do not see one.

  2. If not, what would be an good way to do this suitable for large amounts of data?


回答1:


  1. S1 = Generate Id,Timestamp i.e. from t1...tn
  2. S2 = Generate Id,Timestamp i.e. from t2...tn
  3. S3 = Join S1 by Id,S2 by Id
  4. S4 = Extract S1.Timestamp,S2.Timestamp,(S2.TimeStamp - S1.TimeStamp)

Edit

Sample Data

2014-02-19T01:03:37
2014-02-26T01:03:39
2014-02-28T01:03:45
2014-04-01T01:04:22
2014-05-11T01:06:02
2014-06-30T01:08:56

Script

s1 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s11 = foreach s1 generate ToDate(t) as t1;
s1_new = rank s11;

s2 = LOAD 'test2.txt' USING PigStorage() AS (t:chararray);
s22 = foreach s2 generate ToDate(t) as t1;
s2_new = rank s22;

-- Filter records by excluding the 1 ranked row and rank the new data
ss = FILTER s2_new by (rank_s22 > 1);
ss_new = rank ss;

s3 = join s1_new by rank_s11,ss_new by rank_ss;
s4 = foreach s3 generate DaysBetween(ss_new::t1,s1_new::t1) as time_diff;

DUMP s4;



来源:https://stackoverflow.com/questions/35006499/time-differences-in-apache-pig

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!