first_value windowing function in pyspark

前端 未结 1 349
时光说笑
时光说笑 2020-12-19 22:14

I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.

According to this there exists an analytic function called first

相关标签:
1条回答
  • 2020-12-19 22:43

    Spark >= 2.0:

    first takes an optional ignorenulls argument which can mimic the behavior of first_value:

    df.select(col("k"), first("v", True).over(w).alias("fv"))
    

    Spark < 2.0:

    Available function is called first and can be used as follows:

    df = sc.parallelize([
        ("a", None), ("a", 1), ("a", -1), ("b", 3)
    ]).toDF(["k", "v"])
    
    w = Window().partitionBy("k").orderBy("v")
    
    df.select(col("k"), first("v").over(w).alias("fv"))
    

    but if you want to ignore nulls you'll have to use Hive UDFs directly:

    df.registerTempTable("df")
    
    sqlContext.sql("""
        SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
        FROM df""")
    
    0 讨论(0)
提交回复
热议问题