Replace empty strings with None/null values in DataFrame

后端 未结 5 1525
野趣味
野趣味 2020-12-13 10:01

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null

相关标签:
5条回答
  • 2020-12-13 10:11

    This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:

    def emptyStringsToNone(df: DataFrame): DataFrame = {
      df.schema.foldLeft(df)(
        (current, field) =>
          field.dataType match {
            case DataTypes.StringType =>
              current.withColumn(
                field.name,
                when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
              )
            case _ => current
          }
      )
    }
    
    0 讨论(0)
  • 2020-12-13 10:18

    My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:

      // Replace empty Strings with null values
      private def setEmptyToNull(df: DataFrame): DataFrame = {
        val exprs = df.schema.map { f =>
          f.dataType match {
            case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
            case _ => col(f.name)
          }
        }
    
        df.select(exprs: _*)
      }
    

    You can easily rewrite the function above in Python.

    I learned this trick from @liancheng

    0 讨论(0)
  • 2020-12-13 10:20

    It is as simple as this:

    from pyspark.sql.functions import col, when
    
    def blank_as_null(x):
        return when(col(x) != "", col(x)).otherwise(None)
    
    dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))
    
    dfWithEmptyReplaced.show()
    ## +----+----+
    ## |col1|col2|
    ## +----+----+
    ## | foo|   1|
    ## |null|   2|
    ## |null|null|
    ## +----+----+
    
    dfWithEmptyReplaced.na.drop().show()
    ## +----+----+
    ## |col1|col2|
    ## +----+----+
    ## | foo|   1|
    ## +----+----+
    

    If you want to fill multiple columns you can for example reduce:

    to_convert = set([...]) # Some set of columns
    
    reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
    

    or use comprehension:

    exprs = [
        blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]
    
    testDF.select(*exprs)
    

    If you want to specifically operate on string fields please check the answer by robin-loxley.

    0 讨论(0)
  • Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.

    from pyspark.sql.types import StringType
    string_fields = []
    for i, f in enumerate(test_df.schema.fields):
        if isinstance(f.dataType, StringType):
            string_fields.append(f.name)
    
    0 讨论(0)
  • 2020-12-13 10:35

    UDFs are not terribly efficient. The correct way to do this using a built-in method is:

    df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))
    
    0 讨论(0)
提交回复
热议问题