pyspark Column is not iterable

后端 未结 2 765
长情又很酷
长情又很酷 2020-12-10 12:08

Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max:

linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31|          


        
相关标签:
2条回答
  • 2020-12-10 12:52

    The general technique for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import the Spark SQL functions module like this:

    from pyspark.sql import functions as F 
    # USAGE: F.col(), F.max(), ...
    

    Then, using the OP's example, you'd simply apply F like this:

    linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
                                   .agg(F.max(F.col("cycle")))
    

    This is the general way of avoiding this issue, and is found in practice.

    0 讨论(0)
  • 2020-12-10 13:00

    It's because, you've overwritten the max definition provided by apache-spark, it was easy to spot because max was expecting an iterable.

    To fix this, you can use a different syntax, and it should work.

    inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})
    

    or alternatively

    from pyspark.sql.functions import max as sparkMax
    
    linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))
    
    0 讨论(0)
提交回复
热议问题