DataFrame-ified zipWithIndex

后端未结
关注
 8  1451
悲哀的现实 2020-11-27 04:23
I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD

      
      
        
          8条回答        

        
                    
            
            
                         
                
              
              
                
                   [愿得一人]
                                             
                
                
                (楼主)
            
              
              
                2020-11-27 04:54
              

            
            
                        
Here is my proposal, the advantages of which are:


It does not involve any serialization/deserialization^[1] of our DataFrame's InternalRows.
Its logic is minimalist by relying only on RDD.zipWithIndex.


Its major down sides are: 


It is impossible to use it directly from non-JVM APIs (pySpark, SparkR).
It has to be under the package org.apache.spark.sql;.


imports:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.functions.lit


/**
  * Optimized Spark SQL equivalent of RDD.zipWithIndex.
  *
  * @param df
  * @param indexColName
  * @return `df` with a column named `indexColName` of consecutive unique ids.
  */
def zipWithIndex(df: DataFrame, indexColName: String = "index"): DataFrame = {
  import df.sparkSession.implicits._

  val dfWithIndexCol: DataFrame = df
    .drop(indexColName)
    .select(lit(0L).as(indexColName), $"*")

  val internalRows: RDD[InternalRow] = dfWithIndexCol
    .queryExecution
    .toRdd
    .zipWithIndex()
    .map {
      case (internalRow: InternalRow, index: Long) =>
        internalRow.setLong(0, index)
        internalRow
    }

  Dataset.ofRows(
    df.sparkSession,
    LogicalRDD(dfWithIndexCol.schema.toAttributes, internalRows)(df.sparkSession)
  )





^[1]: (from/to InternalRow's underlying bytes array <--> GenericRow's underlying JVM objects collection Array[Any]).
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它8个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复