Spark 1.6 SQL or Dataframe or Windows

爷,独闯天下 提交于 2019-12-04 06:44:28

问题


I have a data dump of Work orders as below. I need to identify the orders who are all having the status of both 'In Progress' and 'Finished'.

Also, need to display display only in case of 'In progress' status with 'Finished/Not Valid' status. The output I have mentioned below. What is the best approach I can follow for the same in Spark. The input and output are attached here.

Input

Work_ Req_Id,Assigned to,Date,Status
R1,John,3/4/15,In Progress
R1,George,3/5/15,In Progress
R2,Peter,3/6/15,In Progress
R3,Alaxender,3/7/15,Finished
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R4,Patrick,3/10/15,Not Valid
R5,Peter,3/11/15,Finished
R6,,3/12/15,Not Valid
R7,George,3/13/15,Not Valid
R7,George,3/14/15,In Progress
R8,John,3/15/15,Finished
R8,John,3/16/15,Failed
R9,Alaxender,3/17/15,Finished
R9,John,3/18/15,Removed
R10,Patrick,3/19/15,In Progress
R10,Patrick,3/20/15,Finished
R10,Peter,3/21/15,Hold

Output

Work_ Req_Id,Assigned to,Date,Status
R3,Alaxender,3/7/15,Finished
R3,Alaxender,3/8/15,In Progress
R7,George,3/13/15,Not Valid
R7,George,3/14/15,In Progress
R10,Patrick,3/19/15,In Progress
R10,Patrick,3/20/15,Finished
R10,Peter,3/21/15,Hold

回答1:


You can use groupBy with collect_list to collect the status list per Work_Req_Id along with a UDF to filter for the wanted statuses. The grouped dataframe is then joined with the original dataframe.

Window functions aren't being proposed here given that Spark 1.6 doesn't seem to support collect_list/collect_set in window operations.

val df = Seq(
  ("R1", "John", "3/4/15", "In Progress"),
  ("R1", "George", "3/5/15", "In Progress"),
  ("R2", "Peter", "3/6/15", "In Progress"),
  ("R3", "Alaxender", "3/7/15", "Finished"),
  ("R3", "Alaxender", "3/8/15", "In Progress"),
  ("R4", "Patrick", "3/9/15", "Finished"),
  ("R4", "Patrick", "3/10/15", "Not Valid"),
  ("R5", "Peter", "3/11/15", "Finished"),
  ("R6", "", "3/12/15", "Not Valid"),
  ("R7", "George", "3/13/15", "Not Valid"),
  ("R7", "George", "3/14/15", "In Progress"),
  ("R8", "John", "3/15/15", "Finished"),
  ("R8", "John", "3/16/15", "Failed"),
  ("R9", "Alaxender", "3/17/15", "Finished"),
  ("R9", "John", "3/18/15", "Removed"),
  ("R10", "Patrick", "3/19/15", "In Progress"),
  ("R10", "Patrick", "3/20/15", "Finished"),
  ("R10", "Patrick", "3/21/15", "Hold")
).toDF("Work_Req_Id", "Assigned_To", "Date", "Status")

def wanted = udf(
  (statuses: Seq[String]) => statuses.contains("In Progress") &&
    (statuses.contains("Finished") || statuses.contains("Not Valid"))
)

val df2 = df.groupBy($"Work_Req_Id").agg(collect_list($"Status").as("Statuses")).
  where( wanted($"Statuses") ).
  drop($"Statuses")

df.join(df2, Seq("Work_Req_Id")).show

// +-----------+-----------+-------+-----------+
// |Work_Req_Id|Assigned_To|   Date|     Status|
// +-----------+-----------+-------+-----------+
// |         R3|  Alaxender| 3/7/15|   Finished|
// |         R3|  Alaxender| 3/8/15|In Progress|
// |         R7|     George|3/13/15|  Not Valid|
// |         R7|     George|3/14/15|In Progress|
// |        R10|    Patrick|3/19/15|In Progress|
// |        R10|    Patrick|3/20/15|   Finished|
// |        R10|    Patrick|3/21/15|       Hold|
// +-----------+-----------+-------+-----------+


来源:https://stackoverflow.com/questions/49472718/spark-1-6-sql-or-dataframe-or-windows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!