Spark DataSet filter performance
I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different. The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class. val df = spark.read.csv(csvFile).as[FireIncident] A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host) df.where($"UnitID" === "B02").count() 2) Use temp table and sql query (~ same as option 1) df.createOrReplaceTempView("FireIncidentsSF") spark