remove last few characters in PySpark dataframe column

后端 未结 4 674
一向
一向 2020-12-06 12:03

I am having a PySpark DataFrame. How can I chop off/remove last 5 characters from the column name below -

from pyspark.sql.functions import subs         


        
相关标签:
4条回答
  • 2020-12-06 12:30

    You can use expr function

    >>> from pyspark.sql.functions import substring, length, col, expr
    >>> df = df.withColumn("flower",expr("substring(name, 1, length(name)-5)"))
    >>> df.show()
    +--------------+----+---------+
    |          name|year|   flower|
    +--------------+----+---------+
    |     rose_2012|2012|     rose|
    |  jasmine_2013|2013|  jasmine|
    |     lily_2014|2014|     lily|
    | daffodil_2017|2017| daffodil|
    |sunflower_2016|2016|sunflower|
    +--------------+----+---------+
    
    0 讨论(0)
  • 2020-12-06 12:36

    In this case, since we want to extract alphabetical characters, so REGEX will also work.

    from pyspark.sql.functions import regexp_extract 
    df = df.withColumn("flower",regexp_extract(df['name'], '[a-zA-Z]+',0))
    df.show()
    +--------------+----+---------+
    |          name|year|   flower|
    +--------------+----+---------+
    |     rose_2012|2012|     rose|
    |  jasmine_2013|2013|  jasmine|
    |     lily_2014|2014|     lily|
    | daffodil_2017|2017| daffodil|
    |sunflower_2016|2016|sunflower|
    +--------------+----+---------+
    
    0 讨论(0)
  • 2020-12-06 12:49

    You can use split function. this code does what you want:

    import pyspark.sql.functions as f
    
    newDF = df.withColumn("year", f.split(df['name'], '\_')[1]).\
               withColumn("flower", f.split(df['name'], '\_')[0])
    
    newDF.show()
    
    +--------------+----+---------+
    |          name|year|   flower|
    +--------------+----+---------+
    |     rose_2012|2012|     rose|
    |  jasmine_2013|2013|  jasmine|
    |     lily_2014|2014|     lily|
    | daffodil_2017|2017| daffodil|
    |sunflower_2016|2016|sunflower|
    +--------------+----+---------+
    
    0 讨论(0)
  • 2020-12-06 12:52

    Adding little tweak to avoid hard coding and identify column length dynamically through location of underscore('_') using instr function.

    df = spark.createDataFrame([('rose_2012',),('jasmine_2013',),('lily_2014',),('daffodil_2017',),('sunflower_2016',)],['name'])
    
    
    df.withColumn("flower",expr("substr(name, 1, (instr(name,'_')-1) )")).\
            withColumn("year",expr("substr(name, (instr(name,'_')+1),length(name))")).show()
    
    0 讨论(0)
提交回复
热议问题