Spark DataFrame filtering: retain element belonging to a list

狂风中的少年 提交于 2019-12-22 08:37:37

问题


I am using Spark 1.5.1 with Scala on Zeppelin notebook.

  • I have a DataFrame with a column called userID with Long type.
  • In total I have about 4 million rows and 200,000 unique userID.
  • I have also a list of 50,000 userID to exclude.
  • I can easily build the list of userID to retain.

What is the best way to delete all the rows that belong to the users to exclude?

Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?

I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.

Here is the code that I am applying:

import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))

In the code above:

  • the initialDataFrame has 3885068 rows, each row has 5 columns, one of these columns called userID and it contains Long values.
  • The listOfUsersToKeep is an Array[Long] and it contains 150,000 Long userID.

I wonder if there is a more efficient solution than the one I am using.

Thanks


回答1:


You can either use join:

val usersToKeep = sc.parallelize(
  listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")

val finalDataFrame = usersToKeep
  .join(initialDataFrame, $"userID" === $"userID_")
  .drop("userID_")

or a broadcast variable and an UDF:

import org.apache.spark.sql.functions.udf

val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))

It should be also possible to broadcast a DataFrame:

import org.apache.spark.sql.functions.broadcast

initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")


来源:https://stackoverflow.com/questions/33824933/spark-dataframe-filtering-retain-element-belonging-to-a-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!