How to resolve the AnalysisException: resolved attribute(s) in Spark

后端未结

关注

 12  910

val rdd = sc.parallelize(Seq((\"vskp\", Array(2.0, 1.0, 2.1, 5.4)),(\"hyd\",Array(1.5, 0.5, 0.9, 3.7)),(\"hyd\", Array(1.5, 0.5, 0.9, 3.2)),(\"tvm\", Array(8.0, 2.9,


                      
              相关标签:


      
      
        
          12条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2020-12-14 07:31
              
            
            
                                                                       
Thanks to Tomer's Answer
For scala - The issue came up when I tried to use the column in the self-join clause, to fix it use the method
// To `and` all the column conditions
def andAll(cols: Iterable[Column]): Column =
   if (cols.isEmpty) lit(true)
   else cols.tail.foldLeft(cols.head) { case (soFar, curr) => soFar.and(curr) }

// To perform join different col name
def renameColAndJoin(leftDf: DataFrame, joinCols: Seq[String], joinType: String = "inner")(rightDf: DataFrame): DataFrame = {

   val renamedCols: Seq[String]          = joinCols.map(colName => s"${colName}_renamed")
   val zippedCols: Seq[(String, String)] = joinCols.zip(renamedCols)

   val renamedRightDf: DataFrame = zippedCols.foldLeft(rightDf) {
     case (df, (origColName, renamedColName)) => df.withColumnRenamed(origColName, renamedColName)
   }

   val joinExpr: Column = andAll(zippedCols.map {
     case (origCol, renamedCol) => renamedRightDf(renamedCol).equalTo(rightDf(origCol))
   })

   leftDf.join(renamedRightDf, joinExpr, joinType)

}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2020-12-14 07:33
              
            
            
                                                                       
If you have df1, and df2 derived from df1, try renaming all columns in df2 such that no two columns have identical name after join.  So before the join:

so instead of df1.join(df2...

do

# Step 1 rename shared column names in df2.
df2_renamed = df2.withColumnRenamed('columna', 'column_a_renamed').withColumnRenamed('columnb', 'column_b_renamed')

# Step 2 do the join on the renamed df2 such that no two columns have same name.
df1.join(df2_renamed)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-12-14 07:34
              
            
            
                                                                       
As mentioned in my comment, it is related to https://issues.apache.org/jira/browse/SPARK-10925 and, more specifically https://issues.apache.org/jira/browse/SPARK-14948. Reuse of the reference will create ambiguity in naming, so you will have to clone the df - see the last comment in https://issues.apache.org/jira/browse/SPARK-14948 for an example.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2020-12-14 07:35
              
            
            
                                                                       
This issue really killed a lot of my time and I finally got an easy solution for it.
In PySpark, for the problematic column, say colA, we could simply use
import pyspark.sql.functions as F

df = df.select(F.col("colA").alias("colA"))

prior to using df in the join.
I think this should work for Scala/Java Spark too.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一整个雨季        
                
              
                            
                2020-12-14 07:38
              
            
            
                                                                       
In my case this error appeared during self join of same table.
I was facing the below issue with Spark SQL and not the dataframe API: 

org.apache.spark.sql.AnalysisException: Resolved attribute(s) originator#3084,program_duration#3086,originator_locale#3085 missing from program_duration#1525,guid#400,originator_locale#1524,EFFECTIVE_DATETIME_UTC#3157L,device_timezone#2366,content_rpd_id#734L,originator_sublocale#2355,program_air_datetime_utc#3155L,originator#1523,master_campaign#735,device_provider_id#2352 in operator !Deduplicate [guid#400, program_duration#3086, device_timezone#2366, originator_locale#3085, originator_sublocale#2355, master_campaign#735, EFFECTIVE_DATETIME_UTC#3157L, device_provider_id#2352, originator#3084, program_air_datetime_utc#3155L, content_rpd_id#734L]. Attribute(s) with the same name appear in the operation: originator,program_duration,originator_locale. Please check if the right attribute(s) are used.;;


Earlier I was using below query,

    SELECT * FROM DataTable as aext
             INNER JOIN AnotherDataTable LAO 
ON aext.device_provider_id = LAO.device_provider_id 


Selecting only required columns before joining solved the issue for me.

      SELECT * FROM (
    select distinct EFFECTIVE_DATE,system,mso_Name,EFFECTIVE_DATETIME_UTC,content_rpd_id,device_provider_id 
from DataTable 
) as aext
         INNER JOIN AnotherDataTable LAO ON aext.device_provider_id = LAO.device_provider_id 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-12-14 07:39
              
            
            
                                                                       
I got the same issue when trying to use one DataFrame in two consecutive joins.

Here is the problem: DataFrame A has 2 columns (let's call them x and y) and DataFrame B has 2 columns as well (let's call them w and z). I need to join A with B on x=z and then join them together on y=z.

(A join B on A.x=B.z) as C join B on C.y=B.z


I was getting the exact error that in the second join it was complaining "resolved attribute(s) B.z#1234 ...".

Following the links @Erik provided and some other blogs and questions, I gathered I need a clone of B.

Here is what I did:



val aDF = ...
val bDF = ...
val bCloned = spark.createDataFrame(bDF.rdd, bDF.schema)
aDF.join(bDF, aDF("x") === bDF("z")).join(bCloned, aDF("y") === bCloned("z"))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复