Spark SQL UDF with complex input parameter

后端 未结 2 434
Happy的楠姐
Happy的楠姐 2020-12-19 13:13

I\'m trying to use UDF with input type Array of struct. I have the following structure of data this is only relevant part of a bigger structure

|--investment         


        
相关标签:
2条回答
  • 2020-12-19 13:42

    I created a simple library which derives the necessary encoders for complex Product types based on the input type parameters.

    https://github.com/lesbroot/typedudf

    import typedudf.TypedUdf
    import typedudf.ParamEncoder._
    
    case class Foo(x: Int, y: String)
    val fooUdf = TypedUdf((foo: Foo) => foo.x + foo.y.length)
    df.withColumn("sum", fooUdf($"foo"))
    
    0 讨论(0)
  • 2020-12-19 14:05

    The error you see should be pretty much self-explanatory. There is a strict mapping between Catalyst / SQL types and Scala types which can be found in the relevant section of the Spark SQL, DataFrames and Datasets Guide.

    In particular struct types are converted to o.a.s.sql.Row (in your particular case data will be exposed as Seq[Row]).

    There are different methods which can be used to expose data as specific types:

    • Defining UDT (user defined type) which has been removed in 2.0.0 and has no replacement for now.
    • Converting DataFrame to Dataset[T] where T is a desired local type.

    with only the former approach could be applicable in this particular scenario.

    If you want to access investments.funding_round.raised_amount using UDF you'll need something like this:

    val getRaisedAmount = udf((investments: Seq[Row]) => scala.util.Try(
      investments.map(_.getAs[Row]("funding_round").getAs[Long]("raised_amount"))
    ).toOption)
    

    but simple select should be much safer and cleaner:

    df.select($"investments.funding_round.raised_amount")
    
    0 讨论(0)
提交回复
热议问题