How to “negative select” columns in spark's dataframe

前端 未结 9 2009
野的像风
野的像风 2020-12-15 05:35

I can\'t figure it out, but guess it\'s simple. I have a spark dataframe df. This df has columns \"A\",\"B\" and \"C\". Now let\'s say I have an Array containing the name of

相关标签:
9条回答
  • 2020-12-15 05:37

    //selectWithout allows you to specify which columns to omit:

    df.selectWithout("B")
    
    0 讨论(0)
  • 2020-12-15 05:40

    You were almost there: just map the filtered array to col and unpack the list using : _*:

    df.select(column_names.filter(_!="B").map(col): _*)
    
    0 讨论(0)
  • 2020-12-15 05:46

    Since Spark 1.4 you can use drop method:

    Scala:

    case class Point(x: Int, y: Int)
    val df = sqlContext.createDataFrame(Point(0, 0) :: Point(1, 2) :: Nil)
    df.drop("y")
    

    Python:

    df = sc.parallelize([(0, 0), (1, 2)]).toDF(["x", "y"])
    df.drop("y")
    ## DataFrame[x: bigint]
    
    0 讨论(0)
  • 2020-12-15 05:46

    For Spark v1.4 and higher, using drop(*cols) -

    Returns a new DataFrame without the specified column(s).

    Example -

    df.drop('age').collect()
    

    For Spark v2.3 and higher you could also do it using colRegex(colName) -

    Selects column based on the column name specified as a regex and returns it as Column.

    Example-

    df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])
    df.select(df.colRegex("`(Col1)?+.+`")).show()
    

    Reference - colRegex, drop


    For older versions of Spark, take the list of columns in dataframe, then remove columns you want to drop from it (maybe using set operations) and then use select to pick the resultant list.

    0 讨论(0)
  • 2020-12-15 05:48

    Will be possible to do through [SPARK-12139] REGEX Column Specification for Hive Queries

    https://issues.apache.org/jira/browse/SPARK-12139

    0 讨论(0)
  • 2020-12-15 05:49

    I had the same problem and solved it this way (oaffdf is a dataframe):

    val dropColNames = Seq("col7","col121")
    val featColNames = oaffdf.columns.diff(dropColNames)
    val featCols = featColNames.map(cn => org.apache.spark.sql.functions.col(cn))
    val featsdf = oaffdf.select(featCols: _*)
    

    https://forums.databricks.com/questions/2808/select-dataframe-columns-from-a-sequence-of-string.html

    0 讨论(0)
提交回复
热议问题