pyspark split a column to multiple columns without pandas

前端 未结 2 1012
时光取名叫无心
时光取名叫无心 2021-01-02 22:40

my question is how to split a column to multiple columns. I don\'t know why df.toPandas() does not work.

For example, I would like to change \'df_test\'

相关标签:
2条回答
  • 2021-01-02 22:56

    Spark >= 2.2

    You can skip unix_timestamp and cast and use to_date or to_timestamp:

    from pyspark.sql.functions import to_date, to_timestamp
    
    df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show()
    ## +---+----------+
    ## | id|      date|
    ## +---+----------+
    ## |  1|2015-07-14|
    ## |  2|2015-06-14|
    ## |  3|2015-10-11|
    ## +---+----------+
    
    
    df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show()
    ## +---+-------------------+
    ## | id|               date|
    ## +---+-------------------+
    ## |  1|2015-07-14 00:00:00|
    ## |  2|2015-06-14 00:00:00|
    ## |  3|2015-10-11 00:00:00|
    ## +---+-------------------+
    

    and then apply other datetime functions shown below.

    Spark < 2.2

    It is not possible to derive multiple top level columns in a single access. You can use structs or collection types with an UDF like this:

    from pyspark.sql.types import StringType, StructType, StructField
    from pyspark.sql import Row
    from pyspark.sql.functions import udf, col
    
    schema = StructType([
      StructField("day", StringType(), True),
      StructField("month", StringType(), True),
      StructField("year", StringType(), True)
    ])
    
    def split_date_(s):
        try:
            d, m, y = s.split("-")
            return d, m, y
        except:
            return None
    
    split_date = udf(split_date_, schema)
    
    transformed = df_test.withColumn("date", split_date(col("date")))
    transformed.printSchema()
    
    ## root
    ##  |-- id: long (nullable = true)
    ##  |-- date: struct (nullable = true)
    ##  |    |-- day: string (nullable = true)
    ##  |    |-- month: string (nullable = true)
    ##  |    |-- year: string (nullable = true)
    

    but it is not only quite verbose in PySpark, but also expensive.

    For date based transformations you can simply use built-in functions:

    from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format
    
    transformed = (df_test
        .withColumn("ts",
            unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp"))
        .withColumn("day", dayofmonth(col("ts")).cast("string"))
        .withColumn("month", date_format(col("ts"), "MMM"))
        .withColumn("year", year(col("ts")).cast("string"))
        .drop("ts"))
    

    Similarly you could use regexp_extract to split date string.

    See also Derive multiple columns from a single column in a Spark DataFrame

    Note:

    If you use version not patched against SPARK-11724 this will require correction after unix_timestamp(...) and before cast("timestamp").

    0 讨论(0)
  • 2021-01-02 23:00

    The Solution here is to use pyspark.sql.functions.split() function.

    df = sqlContext.createDataFrame([
    (1, '14-Jul-15'),
    (2, '14-Jun-15'),
    (3, '11-Oct-15'),
    ], ('id', 'date'))
    
    split_col = pyspark.sql.functions.split(df['date'], '-')
    df = df.withColumn('day', split_col.getItem(0))
    df = df.withColumn('month', split_col.getItem(1))
    df = df.withColumn('year', split_col.getItem(2))
    df = df.drop("date")
    
    0 讨论(0)
提交回复
热议问题