How to resolve the AnalysisException: resolved attribute(s) in Spark

后端 未结 12 900
故里飘歌
故里飘歌 2020-12-14 07:03
val rdd = sc.parallelize(Seq((\"vskp\", Array(2.0, 1.0, 2.1, 5.4)),(\"hyd\",Array(1.5, 0.5, 0.9, 3.7)),(\"hyd\", Array(1.5, 0.5, 0.9, 3.2)),(\"tvm\", Array(8.0, 2.9,         


        
相关标签:
12条回答
  • 2020-12-14 07:31

    Thanks to Tomer's Answer

    For scala - The issue came up when I tried to use the column in the self-join clause, to fix it use the method

    // To `and` all the column conditions
    def andAll(cols: Iterable[Column]): Column =
       if (cols.isEmpty) lit(true)
       else cols.tail.foldLeft(cols.head) { case (soFar, curr) => soFar.and(curr) }
    
    // To perform join different col name
    def renameColAndJoin(leftDf: DataFrame, joinCols: Seq[String], joinType: String = "inner")(rightDf: DataFrame): DataFrame = {
    
       val renamedCols: Seq[String]          = joinCols.map(colName => s"${colName}_renamed")
       val zippedCols: Seq[(String, String)] = joinCols.zip(renamedCols)
    
       val renamedRightDf: DataFrame = zippedCols.foldLeft(rightDf) {
         case (df, (origColName, renamedColName)) => df.withColumnRenamed(origColName, renamedColName)
       }
    
       val joinExpr: Column = andAll(zippedCols.map {
         case (origCol, renamedCol) => renamedRightDf(renamedCol).equalTo(rightDf(origCol))
       })
    
       leftDf.join(renamedRightDf, joinExpr, joinType)
    
    }
    
    0 讨论(0)
  • 2020-12-14 07:33

    If you have df1, and df2 derived from df1, try renaming all columns in df2 such that no two columns have identical name after join. So before the join:

    so instead of df1.join(df2...

    do

    # Step 1 rename shared column names in df2.
    df2_renamed = df2.withColumnRenamed('columna', 'column_a_renamed').withColumnRenamed('columnb', 'column_b_renamed')
    
    # Step 2 do the join on the renamed df2 such that no two columns have same name.
    df1.join(df2_renamed)
    
    0 讨论(0)
  • 2020-12-14 07:34

    As mentioned in my comment, it is related to https://issues.apache.org/jira/browse/SPARK-10925 and, more specifically https://issues.apache.org/jira/browse/SPARK-14948. Reuse of the reference will create ambiguity in naming, so you will have to clone the df - see the last comment in https://issues.apache.org/jira/browse/SPARK-14948 for an example.

    0 讨论(0)
  • 2020-12-14 07:35

    This issue really killed a lot of my time and I finally got an easy solution for it.

    In PySpark, for the problematic column, say colA, we could simply use

    import pyspark.sql.functions as F
    
    df = df.select(F.col("colA").alias("colA"))
    

    prior to using df in the join.

    I think this should work for Scala/Java Spark too.

    0 讨论(0)
  • 2020-12-14 07:38

    In my case this error appeared during self join of same table. I was facing the below issue with Spark SQL and not the dataframe API:

    org.apache.spark.sql.AnalysisException: Resolved attribute(s) originator#3084,program_duration#3086,originator_locale#3085 missing from program_duration#1525,guid#400,originator_locale#1524,EFFECTIVE_DATETIME_UTC#3157L,device_timezone#2366,content_rpd_id#734L,originator_sublocale#2355,program_air_datetime_utc#3155L,originator#1523,master_campaign#735,device_provider_id#2352 in operator !Deduplicate [guid#400, program_duration#3086, device_timezone#2366, originator_locale#3085, originator_sublocale#2355, master_campaign#735, EFFECTIVE_DATETIME_UTC#3157L, device_provider_id#2352, originator#3084, program_air_datetime_utc#3155L, content_rpd_id#734L]. Attribute(s) with the same name appear in the operation: originator,program_duration,originator_locale. Please check if the right attribute(s) are used.;;
    

    Earlier I was using below query,

        SELECT * FROM DataTable as aext
                 INNER JOIN AnotherDataTable LAO 
    ON aext.device_provider_id = LAO.device_provider_id 
    

    Selecting only required columns before joining solved the issue for me.

          SELECT * FROM (
        select distinct EFFECTIVE_DATE,system,mso_Name,EFFECTIVE_DATETIME_UTC,content_rpd_id,device_provider_id 
    from DataTable 
    ) as aext
             INNER JOIN AnotherDataTable LAO ON aext.device_provider_id = LAO.device_provider_id 
    
    0 讨论(0)
  • 2020-12-14 07:39

    I got the same issue when trying to use one DataFrame in two consecutive joins.

    Here is the problem: DataFrame A has 2 columns (let's call them x and y) and DataFrame B has 2 columns as well (let's call them w and z). I need to join A with B on x=z and then join them together on y=z.

    (A join B on A.x=B.z) as C join B on C.y=B.z
    

    I was getting the exact error that in the second join it was complaining "resolved attribute(s) B.z#1234 ...".

    Following the links @Erik provided and some other blogs and questions, I gathered I need a clone of B.

    Here is what I did:

    val aDF = ...
    val bDF = ...
    val bCloned = spark.createDataFrame(bDF.rdd, bDF.schema)
    aDF.join(bDF, aDF("x") === bDF("z")).join(bCloned, aDF("y") === bCloned("z"))
    
    0 讨论(0)
提交回复
热议问题