Spark DataFrame filtering: retain element belonging to a list

问题

I am using Spark 1.5.1 with Scala on Zeppelin notebook.

I have a DataFrame with a column called userID with Long type.
In total I have about 4 million rows and 200,000 unique userID.
I have also a list of 50,000 userID to exclude.
I can easily build the list of userID to retain.

What is the best way to delete all the rows that belong to the users to exclude?

Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain?

I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1.5.1 on my local machine, an I have decent RAM memory of 16GB and the initial DataFrame fits in the memory.

Here is the code that I am applying:

import org.apache.spark.sql.functions.lit
val finalDataFrame = initialDataFrame.where($"userID".in(listOfUsersToKeep.map(lit(_)):_*))

In the code above:

the initialDataFrame has 3885068 rows, each row has 5 columns, one of these columns called userID and it contains Long values.
The listOfUsersToKeep is an Array[Long] and it contains 150,000 Long userID.

I wonder if there is a more efficient solution than the one I am using.

Thanks

回答1:

You can either use join:

val usersToKeep = sc.parallelize(
  listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")

val finalDataFrame = usersToKeep
  .join(initialDataFrame, $"userID" === $"userID_")
  .drop("userID_")

or a broadcast variable and an UDF:

import org.apache.spark.sql.functions.udf

val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))

It should be also possible to broadcast a DataFrame:

import org.apache.spark.sql.functions.broadcast

initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")

来源：https://stackoverflow.com/questions/33824933/spark-dataframe-filtering-retain-element-belonging-to-a-list

标签

scala

apache-spark

dataframe

apache-spark-sql

apache-zeppelin