Pyspark Split Columns

后端 未结 1 1935
死守一世寂寞
死守一世寂寞 2020-12-19 13:26
from pyspark.sql import Row, functions as F
row = Row(\"UK_1\",\"UK_2\",\"Date\",\"Cat\",\'Combined\')
agg = \'\'
agg = \'Cat\'
tdf = (sc.parallelize
    ([
                 


        
相关标签:
1条回答
  • 2020-12-19 13:38

    The pattern is a regular expression, see split; and ^ is an anchor that matches the beginning of string in regex, to match literally, you need to escape it:

    cols = F.split(tdf['Combined'], r'\^')
    tdf = tdf.withColumn('column1', cols.getItem(0))
    tdf = tdf.withColumn('column2', cols.getItem(1))
    tdf.show(truncate = False)
    
    +----+----+------------+---+-------------+-------+-------+
    |UK_1|UK_2|Date        |Cat|Combined     |column1|column2|
    +----+----+------------+---+-------------+-------+-------+
    |1   |1   |12/10/2016  |A  |Water^World  |Water  |World  |
    |1   |2   |null        |A  |Sea^Born     |Sea    |Born   |
    |2   |1   |14/10/2016  |B  |Germ^Any     |Germ   |Any    |
    |3   |3   |!~2016/2/276|B  |Fin^Land     |Fin    |Land   |
    |null|1   |26/09/2016  |A  |South^Korea  |South  |Korea  |
    |1   |1   |12/10/2016  |A  |North^America|North  |America|
    |1   |2   |null        |A  |South^America|South  |America|
    |2   |1   |14/10/2016  |B  |New^Zealand  |New    |Zealand|
    |null|null|!~2016/2/276|B  |South^Africa |South  |Africa |
    |null|1   |26/09/2016  |A  |Saudi^Arabia |Saudi  |Arabia |
    +----+----+------------+---+-------------+-------+-------+
    
    0 讨论(0)
提交回复
热议问题