val rdd = sc.parallelize(Seq((\"vskp\", Array(2.0, 1.0, 2.1, 5.4)),(\"hyd\",Array(1.5, 0.5, 0.9, 3.7)),(\"hyd\", Array(1.5, 0.5, 0.9, 3.2)),(\"tvm\", Array(8.0, 2.9,
It will work if you do the below.
suppose you have a dataframe. df1 and if you want to cross join the same dataframe, you can use the below
df1.toDF("ColA","ColB").as("f_df").join(df1.toDF("ColA","ColB").as("t_df"),
$"f_df.pcmdty_id" ===
$"t_df.assctd_pcmdty_id").select($"f_df.pcmdty_id",$"f_df.assctd_pcmdty_id")
[TLDR]
Break the AttributeReference shared between columns in parent DataFrame and derived DataFrame by writing the intermediate DataFrame to file system and reading it again.
Ex:
val df1 = spark.read.parquet("file1")
df1.createOrReplaceTempView("df1")
val df2 = spark.read.parquet("file2")
df2.createOrReplaceTempView("df2")
val df12 = spark.sql("""SELECT * FROM df1 as d1 JOIN df2 as d2 ON d1.a = d2.b""")
df12.createOrReplaceTempView("df12")
val df12_ = spark.sql(""" -- some transformation -- """)
df12_.createOrReplaceTempView("df12_")
val df3 = spark.read.parquet("file3")
df3.createOrReplaceTempView("df3")
val df123 = spark.sql("""SELECT * FROM df12_ as d12_ JOIN df3 as d3 ON d12_.a = d3.c""")
df123.createOrReplaceTempView("df123")
Now joining with top level DataFrame will lead to "unresolved attribute error"
val df1231 = spark.sql("""SELECT * FROM df123 as d123 JOIN df1 as d1 ON d123.a = d1.a""")
Solution: d123.a and d1.a share same AttributeReference break it by writing intermediate table df123 to file system and reading again. now df123write.a and d1.a does not share AttributeReference
val df123 = spark.sql("""SELECT * FROM df12 as d12 JOIN df3 as d3 ON d12.a = d3.c""")
df123.createOrReplaceTempView("df123")
df123.write.parquet("df123.par")
val df123write = spark.read.parquet("df123.par")
spark.catalog.dropTempView("df123")
df123write.createOrReplaceTempView("df123")
val df1231 = spark.sql("""SELECT * FROM df123 as d123 JOIN df1 as d1 ON d123.a = d1.a""")
Long story:
We had complex ETLs with transformation and self joins of DataFrames, performed at multiple levels. We faced "unresolved attribute" error frequently and we solved it by selecting required attribute and performing join on the top level table instead of directly joining with the top level table this solved the issue temporarily but when we applied some more transformation on these DataFrame and joined with any top level DataFrames, "unresolved attribute" error raised its ugly head again.
This was happening because DataFrames in bottom level were sharing the same AttributeReference with top level DataFrames from which they were derived [more details]
So we broke this reference sharing by writing just 1 intermediate transformed DataFrame and reading it again and continuing with our ETL. This broke sharing AttributeReference between bottom DataFrames and Top DataFrames and we never again faced "unresolved attribute" error.
This worked for us because as we moved from top level DataFrame to bottom performing transformation and join our data shrank than initial DataFrames that we started, it also improved our performance as data size was less and spark didn't have to traverse back the DAG all the way to the last persisted DataFrame.
For java developpers, try to call this method:
private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
List<Column> filterColumns = new ArrayList<>();
List<String> filterColumnsNames = new ArrayList<>();
scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
while (it.hasNext()) {
String columnName = it.next().name();
filterColumns.add(ds.col(columnName));
filterColumnsNames.add(columnName);
}
ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
return ds;
}
on both datasets just before the joining, it clone the datasets into new ones:
df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
Dataset<Row> join = df1.join(df2, col("column_name"));
// if it didn't work try this
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));
In my case, Checkpointing the original dataframe fixed the issue.
From my experience, we have 2 solutions 1) clone DF 2) rename columns that have ambiguity before joining tables. (don't forget to drop duplicated join key)
Personally I prefer the second method, because cloning DF in the first method takes time, especially if data size is big.
just rename your columns and put the same name. in pyspark: for i in df.columns: df = df.withColumnRenamed(i,i)