TypeError: 'Column' object is not callable using WithColumn

后端 未结 2 1936
广开言路
广开言路 2020-12-17 23:15

I would like append a new column on dataframe \"df\" from function get_distance:

def get_distance(x, y):
         


        
相关标签:
2条回答
  • 2020-12-17 23:53
    • You cannot use Python function on a Column objects directly, unless it is intended to operate on Column objects / expressions. You need udf for that:

      @udf
      def get_distance(x, y):
          ...
      
    • But you cannot use SQLContext in udf (or mapper in general).

    • Just join:

      tab = hiveContext.table("tab").groupBy("column1", "column2").agg(first("column3"))
      df.join(tab, ["column1", "column2"])
      
    0 讨论(0)
  • 2020-12-17 23:58

    Spark should know the function that you are using is not ordinary function but the UDF.

    So, there are 2 ways by which we can use the UDF on dataframes.

    Method-1: With @udf annotation

    @udf
    def get_distance(x, y):
        dfDistPerc = hiveContext.sql("select column3 as column3, \
                                      from tab \
                                      where column1 = '" + x + "' \
                                      and column2 = " + y + " \
                                      limit 1")
    
        result = dfDistPerc.select("column3").take(1)
        return result
    
    df = df.withColumn(
        "distance",
        lit(get_distance(df["column1"], df["column2"]))
    )
    

    Method-2: Regestering udf with pyspark.sql.functions.udf

    def get_distance(x, y):
        dfDistPerc = hiveContext.sql("select column3 as column3, \
                                      from tab \
                                      where column1 = '" + x + "' \
                                      and column2 = " + y + " \
                                      limit 1")
    
        result = dfDistPerc.select("column3").take(1)
        return result
    
    calculate_distance_udf = udf(get_distance, IntegerType())
    
    df = df.withColumn(
        "distance",
        lit(calculate_distance_udf(df["column1"], df["column2"]))
    )
    
    0 讨论(0)
提交回复
热议问题