SPARK 2.2.2 - Joining multiple RDDs giving out of memory excepton. Resulting RDD has 124 columns. What should be the optimal joining method?

问题

I have a file which has multiple values for each phone number. for example :

phone_no circle operator priority1 attribute1 attribute2 attribute3 priority2 attribute1 attribute2 attribute3 
123445   delhi  airtel   1.0        info1      info2      info3      1.1        info4      info5      info6
987654   bhopal idea     1.1        info1      info2      info3      1.4        info4      info5      info6
123445   delhi  airtel   1.3        info1      info2      info3      1.0        info4      info5      info6

What my expected output is : for each phone number select minimum P1 and it's corresponding attribute values.

As my above example states that for phone number 123445 has P1 in line 1 which is lesser than P1 in line 3 (1.0 < 1.3) so I want to select attributes 1,2 and 3 from line 1 and as P2 in line 3 has lesser value ( 1.0 < 1.1) so I want to select attribute values from line 3.

Here is what I want in a tabular format:

phone_no circle operator priority1 attribute1 attribute2 attribute3 priority2 attribute1 attribute2 attribute3 
123445   delhi  airtel   1.0        info1      info2      info3      1.0        info4      info5      info6
987654   bhopal idea     1.1        info1      info2      info3      1.4        info4      info5      info6

I have 25 different priority values and each priority value has 4 different attributes so my total columns are around 125.

What I have tried so far :

Create a Dataframe which has a phone number as a key and minimum of each priority value.
Create another Dataframe which has values of min(Priority1) along with it's corresponding attributes for each phone number.
Create another Dataframe which has values of min(Priority2) along with it's corresponding attributes for each phone number.
Join these two Dataframes on Phone number to get the complete information and save this Dataframe to disk.

The problem with my approach is that it is not a good approach considering the amount of columns that I have. Please suggest me some good approach for this problem.

EDIT 1 : Here is the pastebin link of what I have done : https://pastebin.com/ps4f1KSh

回答1:

I would probably use window functions:

from pyspark.sql.window import Window
import pyspark.sql.functions as spf

df = spark.createDataFrame([
    (123, 1, 'a', 2, 'c'),
    (123, 2, 'b', 1, 'd'),
    (456, 3, 'e', 4, 'f')
], ['phone', 'priority1', 'attribute1', 'priority2', 'attribute2'])

w = Window.partitionBy('phone')
df2 = (
    df
    .select(
        'phone',
        spf.first('attribute1').over(w.orderBy('priority1')).alias('attribute1'),
        spf.first('attribute2').over(w.orderBy('priority2')).alias('attribute2'),
    )
)

(
    df2
    .groupby('phone')
    .agg(*[spf.first(c).alias(c) for c in df2.columns if c != 'phone'])
    .toPandas()
)

Gives:

   phone attribute1 attribute2
0    123          a          d
1    456          e          f

It's an exercise for the reader to template this out (e.g. using list comprehensions) to generalize to all attributes and priorities.

来源：https://stackoverflow.com/questions/53120335/spark-2-2-2-joining-multiple-rdds-giving-out-of-memory-excepton-resulting-rdd

标签

python

apache-spark

pyspark