Spark: Group RDD Sql Query

こ雲淡風輕ζ 提交于 2019-12-07 11:24:18

Lets work it step by step. Lets first construct the 2009 part

event2009RDD.registerTempTable("base2009")
cellLookupRDD.registerTempTable("lookup")

trns2009 = ssc.sql("select eventtype, id, \
                          min(case when l.cn = 'RPM' then r.date1 else null end) rpmmn, \
max(case when l.cn = 'RPM' then r.date1 else null end) rpmmx, \
min(case when l.cn = 'PPM' then r.date1 else null end) ppmmn, \
max(case when l.cn = 'PPM' then r.date1 else null end) ppmmx, \
from base2009 r inner join lookup l on r.celltype=l.celltype \
group by eventtype,id "

trns2009 .registerTempTable("transformed2009")

Now you can do a full outer join with 1001 data set and get output.

Note: you should not have

4929101,NULL,NULL,2015-01-20 20:44:39,NULL,NULL,NULL

instead, you should have

4929101,NULL,NULL,2015-01-20 20:44:39,2015-01-20 20:44:39,NULL,NULL

Because, if 2009 event have occurred once, then it should have both first and last date. NULL should represent an event never occurred, like for id=4929101, celltype=PPM.

Please let me know if this works (or not). I do not have access to spark right this moment, but should be able to debug, if needed, tonight.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!