I\'ve been trying to find a reasonable way to test SparkSession
with the JUnit testing framework. While there seem to be good examples for SparkContext
I like to create a SparkSessionTestWrapper
trait that can be mixed in to test classes. Shankar's approach works, but it's prohibitively slow for test suites with multiple files.
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("spark session").getOrCreate()
}
}
The trait can be used as follows:
class DatasetSpec extends FunSpec with SparkSessionTestWrapper {
import spark.implicits._
describe("#count") {
it("returns a count of all the rows in a DataFrame") {
val sourceDF = Seq(
("jets"),
("barcelona")
).toDF("team")
assert(sourceDF.count === 2)
}
}
}
Check the spark-spec project for a real-life example that uses the SparkSessionTestWrapper
approach.
Update
The spark-testing-base library automatically adds the SparkSession when certain traits are mixed in to the test class (e.g. when DataFrameSuiteBase
is mixed in, you'll have access to the SparkSession via the spark
variable).
I created a separate testing library called spark-fast-tests to give the users full control of the SparkSession when running their tests. I don't think a test helper library should set the SparkSession. Users should be able to start and stop their SparkSession as they see fit (I like to create one SparkSession and use it throughout the test suite run).
Here's an example of the spark-fast-tests assertSmallDatasetEquality
method in action:
import com.github.mrpowers.spark.fast.tests.DatasetComparer
class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {
import spark.implicits._
it("aliases a DataFrame") {
val sourceDF = Seq(
("jose"),
("li"),
("luisa")
).toDF("name")
val actualDF = sourceDF.select(col("name").alias("student"))
val expectedDF = Seq(
("jose"),
("li"),
("luisa")
).toDF("student")
assertSmallDatasetEquality(actualDF, expectedDF)
}
}
}