How to “negative select” columns in spark's dataframe

前端 未结 9 2034
野的像风
野的像风 2020-12-15 05:35

I can\'t figure it out, but guess it\'s simple. I have a spark dataframe df. This df has columns \"A\",\"B\" and \"C\". Now let\'s say I have an Array containing the name of

9条回答
  •  臣服心动
    2020-12-15 05:56

    OK, it's ugly, but this quick spark shell session shows something that works:

    scala> val myRDD = sc.parallelize(List.range(1,10))
    myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at :21
    
    scala> val myDF = myRDD.toDF("a")
    myDF: org.apache.spark.sql.DataFrame = [a: int]
    
    scala> val myOtherRDD = sc.parallelize(List.range(1,10))
    myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at :21
    
    scala> val myotherDF = myRDD.toDF("b")
    myotherDF: org.apache.spark.sql.DataFrame = [b: int]
    
    scala> myDF.unionAll(myotherDF)
    res2: org.apache.spark.sql.DataFrame = [a: int]
    
    scala> myDF.join(myotherDF)
    res3: org.apache.spark.sql.DataFrame = [a: int, b: int]
    
    scala> val twocol = myDF.join(myotherDF)
    twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]
    
    scala> val cols = Array("a", "b")
    cols: Array[String] = Array(a, b)
    
    scala> val selectedCols = cols.filter(_!="b")
    selectedCols: Array[String] = Array(a)
    
    scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
    res4: org.apache.spark.sql.DataFrame = [a: int]
    

    Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.

提交回复
热议问题