Unable to write PySpark Dataframe created from two zipped dataframes

问题

I am trying to follow the example given here for combining two dataframes without a shared join key (combining by "index" in a database table or pandas dataframe, except that PySpark does not have that concept):

My Code

left_df = left_df.repartition(right_df.rdd.getNumPartitions()) # FWIW, num of partitions = 303
joined_schema = StructType(left_df.schema.fields + right_df.schema.fields)
interim_rdd = left_df.rdd.zip(right_df.rdd).map(lambda x: x[0] + x[1])
full_data = spark.createDataFrame(interim_rdd, joined_schema)

This all seems to work fine. I am testing it out while using DataBricks, and I can run the "cell" above with no problem. But then when I go to save it, I am unable because it complains that the partitions do not match (???). I have confirmed that the number of partitions match, but you can also see above that I am explicitly making sure they match. My save command:

full_data.write.parquet(my_data_path, mode="overwrite")

Error

I receive the following error:

Caused by: org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition

My Guess

I am suspecting the problem is that, even though I have matched the number of partitions, I do not have the same number of rows in each partition. But I do not know how to do that. I only know how to specify the # of partitions, not the way to partition.

Or, more specifically, I do not know the way to specify how to partition if there is no column I can use. Remember, they have no shared column.

How do I know that I can combine them this way, with no shared join key? In this case, it is because I am trying to join model predictions with input data, but I actually have this case more generally, in situations beyond just model data + predictions.

My Questions

Specifically in the case above, how can I properly set up the partitioning so that it works?
How should I join two dataframes by row index?
- (I know the standard response is "you shouldn't... partitioning makes indices nonsensical", but until Spark creates ML libraries that do not force data loss like I described in the link above, this will always be an issue.)

回答1:

You can temporarily switch to RDDs and add an index with zipWithIndex. This index can then be used as join criterium:

#create rdds with an additional index
#as zipWithIndex adds the index as second column, we have to switch
#the first and second column
left = left_df.rdd.zipWithIndex().map(lambda a: (a[1], a[0]))
right= right_df.rdd.zipWithIndex().map(lambda a: (a[1], a[0]))

#join both rdds 
joined = left.fullOuterJoin(right)

#restore the original columns
result = spark.createDataFrame(joined).select("_2._1.*", "_2._2.*")

The Javadoc of zipWithIndex states that

Some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition.

Depending on the nature of the original datasets, this code might not produce deterministic results.

回答2:

RDD's are old hat, but answering from that perspective the error.

From la Trobe University http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#zip the following:

Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-value pairs by the methods provided by the PairRDDFunctions extension.

Note pair.

This means you must have the same partitioner with number of partitions and number of kv's per partition, else the definition above does not hold.

Best applied when reading in from files as repartition(n) may not give same distribution.

A little trick to get around that is to use zipWithIndex for the k of k, v, like so (Scala as not a pyspark specific aspect):

val rddA = sc.parallelize(Seq(
  ("ICCH 1", 10.0), ("ICCH 2", 10.0), ("ICCH 4", 100.0), ("ICCH 5", 100.0)
))
val rddAA = rddA.zipWithIndex().map(x => (x._2, x._1)).repartition(5)

val rddB = sc.parallelize(Seq(
  (10.0, "A"), (64.0, "B"), (39.0, "A"), (9.0, "C"), (80.0, "D"), (89.0, "D")
))
val rddBB = rddA.zipWithIndex().map(x => (x._2, x._1)).repartition(5)

val zippedRDD = (rddAA zip rddBB).map{ case ((id, x), (y, c)) => (id, x, y, c) }
zippedRDD.collect

The repartition(n) then seems to work as the k is the same type.

But you must have same num elements per partition. It is what it is, but it makes sense.

来源：https://stackoverflow.com/questions/63727512/unable-to-write-pyspark-dataframe-created-from-two-zipped-dataframes

标签

python

dataframe

apache-spark

pyspark