Workaround for importing spark implicits everywhere

*爱你&永不变心* 提交于 2020-01-02 04:45:08

问题


I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example:

File A
class A {
    def job(spark: SparkSession) = {
        import spark.implcits._
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }
    private def doSomething(ds: Dataset[Foo], spark: SparkSession) = {
        import spark.implicits._
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    def doSomething(ds: Dataset[Foo]) = {
        import spark.implicits._
        ds.map(e => "SomeString")
    }
}

What I wanted to ask is if there's a cleaner way to be able to do

ds.map(e => "SomeString")

without importing implicits in every function where I do the map? If I don't import it, I get the following error:

Error:(53, 13) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.


回答1:


Something that would help a bit would be to do the import inside the class or object instead of each function. For your "File A" and "File B" examples:

File A
class A {
    val spark = SparkSession.builder.getOrCreate()
    import spark.implicits._

    def job() = {
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }

    private def doSomething(ds: Dataset[Foo]) = {
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    import spark.implicits._

    def doSomething(ds: Dataset[Foo]) = {    
        ds.map(e => "SomeString")
    }
}

In this way, you get a manageable amount of imports.

Unfortunately, to my knowledge there is no other way to reduce the number of imports even more. This is due to the need to the SparkSession object when doing the actual import. Hence, this is the best that can be done.


Update:

An even more convinient method is to create a Scala Trait and combine it with an empty Object. This allows for easy import of implicits at the top of each file while allowing extending the trait to use the SparkSession object.

Example:

trait SparkJob {
  val spark: SparkSession = SparkSession.builder.
    .master(...)
    .config(..., ....) // Any settings to be applied
    .getOrCreate()
}

object SparkJob extends SparkJob {}

With this we can do the following for File A and B:

File A:

import SparkJob.spark.implicits._
class A extends SparkJob {
  spark.sql(...) // Allows for usage of the SparkSession inside the class
  ...
}

File B:

import SparkJob.spark.implicits._
class B extends SparkJob {
  ...    
}

Note that it's only necessary to extend SparkJob for for the classes or objects that use the spark object itself.



来源:https://stackoverflow.com/questions/45724290/workaround-for-importing-spark-implicits-everywhere

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!