Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column 'id') and data frame df2 with two columns, 'id' and 'other'.

Is there a way to replicate the following command

sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")

by using only pyspark functions such as join(), select() and the like?

I have to implement this join in a function and I don't want to be forced to have sqlContext as a function parameter.

Thanks!

Pablo Estevez

Not sure if the most efficient way, but this worked for me:

from pyspark.sql.functions import col

df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

The trick is in:

[col('a.'+xx) for xx in a.columns] : all columns in a

[col('b.other1'),col('b.other2')] : some columns of b

Asterisk (*) works with alias. Ex:

from pyspark.sql.functions import *

df1 = df.alias('df1')
df2 = df.alias('df2')

df1.join(df2, df1.id == df2.id).select('df1.*')

Without using alias.

df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.

a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])

c = a.join(b, a.a_id == b.b_id)

Then, c.show() yields:

+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
|   a|  foo|   p1|   a|
|   b|  hem|   p2|   b|
|   c|  haw|   p3|   c|
+----+-----+-----+----+

drop duplicate b_id

c = a.join(b, a.a_id == b.b_id).drop(b.b_id)

You could just make the join and after that select the wanted columns https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe%20join#pyspark.sql.DataFrame.join

I got an error: 'a not found' using the suggested code:

from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

I changed a.columns to df1.columns and it worked out.

来源：https://stackoverflow.com/questions/36132322/join-two-data-frames-select-all-columns-from-one-and-some-columns-from-the-othe

标签

pyspark

pyspark-sql