How to “negative select” columns in spark's dataframe

前端未结

关注

 9  2034

野的像风 2020-12-15 05:35

I can\'t figure it out, but guess it\'s simple. I have a spark dataframe df. This df has columns \"A\",\"B\" and \"C\". Now let\'s say I have an Array containing the name of

9条回答

臣服心动 (楼主)

2020-12-15 05:56

OK, it's ugly, but this quick spark shell session shows something that works:

scala> val myRDD = sc.parallelize(List.range(1,10))
myRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at parallelize at :21

scala> val myDF = myRDD.toDF("a")
myDF: org.apache.spark.sql.DataFrame = [a: int]

scala> val myOtherRDD = sc.parallelize(List.range(1,10))
myOtherRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at :21

scala> val myotherDF = myRDD.toDF("b")
myotherDF: org.apache.spark.sql.DataFrame = [b: int]

scala> myDF.unionAll(myotherDF)
res2: org.apache.spark.sql.DataFrame = [a: int]

scala> myDF.join(myotherDF)
res3: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val twocol = myDF.join(myotherDF)
twocol: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala> val cols = Array("a", "b")
cols: Array[String] = Array(a, b)

scala> val selectedCols = cols.filter(_!="b")
selectedCols: Array[String] = Array(a)

scala> twocol.select(selectedCols.head, selectedCols.tail: _*)
res4: org.apache.spark.sql.DataFrame = [a: int]

Providings varargs to a function that requires one is treated in other SO questions. The signature of select is there to ensure your list of selected columns is not empty – which makes the conversion from the list of selected columns to varargs a bit more complex.

0 讨论(0)

查看其它9个回答