Pyspark Data Frame: Access to a Column (TypeError: Column is not iterable)

前端 未结 1 766
我寻月下人不归
我寻月下人不归 2020-12-07 05:52

I am struggling with a PySpark code, in particular, I\'d like to call a function on an object col which is not iterable.

from pyspark.sql.functio         


        
相关标签:
1条回答
  • 2020-12-07 06:35

    PySpark is just the Python API written to support Apache Spark. If you want to use custom python functions, you will have to define a user defined function (udf).

    Keep your clean_text() function as is (with the translate line commented out) and try the following:

    from pyspark.sql.functions import udf
    from pyspark.sql.Types import StringType
    
    def translate(c):
      return translator.translate(c, dest='en', src='auto')
    
    translateUDF = udf(translate, StringType())
    
    clean_text_df = uncleanedText.select(
      translateUDF(clean_text(col("unCleanedCol"))).alias("sentence")
    )
    

    The other functions in your original clean_text (lower and regexp_replace) are built-in spark functions and operate on apyspark.sql.Column.

    Be aware that using this udf will bring a performance hit. See: Spark functions vs UDF performance?

    0 讨论(0)
提交回复
热议问题