问题
I have a pyspark Dataframe
, I would like to join 3 columns.
id | column_1 | column_2 | column_3
--------------------------------------------
1 | 12 | 34 | 67
--------------------------------------------
2 | 45 | 78 | 90
--------------------------------------------
3 | 23 | 93 | 56
--------------------------------------------
I want to join the 3 columns : column_1, column_2, column_3
in only one adding between there value "-"
Expect result:
id | column_1 | column_2 | column_3 | column_join
-------------------------------------------------------------
1 | 12 | 34 | 67 | 12-34-67
-------------------------------------------------------------
2 | 45 | 78 | 90 | 45-78-90
-------------------------------------------------------------
3 | 23 | 93 | 56 | 23-93-56
-------------------------------------------------------------
How can I do it in pyspark ? Thank you
回答1:
It's pretty simple:
from pyspark.sql.functions import col, concat, lit
df = df.withColumn("column_join", concat(col("column_1"), lit("-"), col("column_2"), lit("-"), col("column_3")))
Use concat
to concatenate all the columns with the -
separator, for which you will need to use lit
.
If it doesn't directly work, you can use cast
to change the column types to string, col("column_1").cast("string")
UPDATE:
Or you can use a more dynamic approach using a built-in function concat_ws
pyspark.sql.functions.concat_ws(sep, *cols)
Concatenates multiple input string columns together into a single string column, using the given separator. >>> df = spark.createDataFrame([('abcd','123')], ['s', 'd']) >>> df.select(concat_ws('-', df.s, df.d).alias('s')).collect() [Row(s=u'abcd-123')]
Code:
from pyspark.sql.functions import col, concat_ws
concat_columns = ["column_1", "column_2", "column_3"]
df = df.withColumn("column_join", concat_ws("-", *[F.col(x) for x in concat_columns]))
回答2:
Here is a generic/dynamic
way of doing this, instead of manually
concatenating it. All we need is to specify the columns that we need to concatenate.
# Importing requisite functions.
from pyspark.sql.functions import col, udf
# Creating the DataFrame
df = spark.createDataFrame([(1,12,34,67),(2,45,78,90),(3,23,93,56)],['id','column_1','column_2','column_3'])
Now, specifying the list of columns we want to concatenate, separated by -
.
list_of_columns_to_join = ['column_1','column_2','column_3']
Finally, creating a UDF. Mind it, UDF
based solutions are implicitly slower.
def concat_cols(*list_cols):
return '-'.join(list([str(i) for i in list_cols]))
concat_cols = udf(concat_cols)
df = df.withColumn('column_join', concat_cols(*list_of_columns_to_join))
df.show()
+---+--------+--------+--------+-----------+
| id|column_1|column_2|column_3|column_join|
+---+--------+--------+--------+-----------+
| 1| 12| 34| 67| 12-34-67|
| 2| 45| 78| 90| 45-78-90|
| 3| 23| 93| 56| 23-93-56|
+---+--------+--------+--------+-----------+
来源:https://stackoverflow.com/questions/59032577/how-to-concatenate-multiple-columns-in-pyspark-with-a-separator