Creating User Defined Function in Spark-SQL

前端 未结 3 2013
挽巷
挽巷 2020-12-29 06:15

I am new to spark and spark sql and i was trying to query some data using spark SQL.

I need to fetch the month from a date which is given as a string.

I think

相关标签:
3条回答
  • 2020-12-29 07:17

    In Spark 2.0, you can do this:

    // define the UDF
    def convert2Years(date: String) = date.substring(7, 11)
    // register to session
    sparkSession.udf.register("convert2Years", convert2Years(_: String))
    val moviesDf = getMoviesDf // create dataframe usual way
    moviesDf.createOrReplaceTempView("movies") // 'movies' is used in sql below
    val years = sparkSession.sql("select convert2Years(releaseDate) from movies")
    
    0 讨论(0)
  • 2020-12-29 07:20

    You can do this, at least for filtering, if you're willing to use a language-integrated query.

    For a data file dates.txt containing:

    one,2014-06-01
    two,2014-07-01
    three,2014-08-01
    four,2014-08-15
    five,2014-09-15
    

    You can pack as much Scala date magic in your UDF as you want but I'll keep it simple:

    def myDateFilter(date: String) = date contains "-08-"
    

    Set it all up as follows -- a lot of this is from the Programming guide.

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext._
    
    // case class for your records
    case class Entry(name: String, when: String)
    
    // read and parse the data
    val entries = sc.textFile("dates.txt").map(_.split(",")).map(e => Entry(e(0),e(1)))
    

    You can use the UDF as part of your WHERE clause:

    val augustEntries = entries.where('when)(myDateFilter).select('name, 'when)
    

    and see the results:

    augustEntries.map(r => r(0)).collect().foreach(println)
    

    Notice the version of the where method I've used, declared as follows in the doc:

    def where[T1](arg1: Symbol)(udf: (T1) ⇒ Boolean): SchemaRDD
    

    So, the UDF can only take one argument, but you can compose several .where() calls to filter on multiple columns.

    Edit for Spark 1.2.0 (and really 1.1.0 too)

    While it's not really documented, Spark now supports registering a UDF so it can be queried from SQL.

    The above UDF could be registered using:

    sqlContext.registerFunction("myDateFilter", myDateFilter)
    

    and if the table was registered

    sqlContext.registerRDDAsTable(entries, "entries")
    

    it could be queried using

    sqlContext.sql("SELECT * FROM entries WHERE myDateFilter(when)")
    

    For more details see this example.

    0 讨论(0)
  • 2020-12-29 07:20

    In PySpark 1.5 and above, we can easily achieve this with builtin function.

    Following is an example:

    raw_data = 
    [
    
    ("2016-02-27 23:59:59", "Gold", 97450.56),
    
    ("2016-02-28 23:00:00", "Silver", 7894.23),
    
    ("2016-02-29 22:59:58", "Titanium", 234589.66)]
    
    
    Time_Material_revenue_df  = 
    sqlContext.createDataFrame(raw_data, ["Sold_time", "Material", "Revenue"])
    
    from pyspark.sql.functions import  *
    
    Day_Material_reveneu_df = Time_Material_revenue_df.select(to_date("Sold_time").alias("Sold_day"), "Material", "Revenue")
    
    0 讨论(0)
提交回复
热议问题