val rdd = sc.parallelize(Seq((\"vskp\", Array(2.0, 1.0, 2.1, 5.4)),(\"hyd\",Array(1.5, 0.5, 0.9, 3.7)),(\"hyd\", Array(1.5, 0.5, 0.9, 3.2)),(\"tvm\", Array(8.0, 2.9,
This issue really killed a lot of my time and I finally got an easy solution for it.
In PySpark, for the problematic column, say colA
, we could simply use
import pyspark.sql.functions as F
df = df.select(F.col("colA").alias("colA"))
prior to using df
in the join
.
I think this should work for Scala/Java Spark too.