Is there a inbuilt function to compare RDDs on specific criteria or better to write a UDF

问题

How to do I count the occurrences of elements in child RDD occurring in Parent RDD.

Say,

I have two RDDs

Parent RDD -

['2 3 5']
['4 5 7']
['5 4 2 3']

Child RDD

['2 3','5 3','4 7','5 7','5 3','2 3']

I need something like -

[['2 3',2],['5 3',2],['4 7',1],['5 7',1],['5 3',2] ...]

Its actually finding the frequent item candidate set from the parent set.

Now, the child RDD can contain initially string elements or even lists i.e

['1 2','2 3'] or [[1,2],[2,3]]

as that's the data structure that I would implement according to what fits the best.

Question -

Are there inbuild functions which could do something similar to what I am trying to achieve with these two RDDs? Any transformations?
Or writing a UDF that parses each element of child and compares it to parent is needed, now my data is a lot so I doubt this would be efficient.
- In case I end up writing a UDF should I use the foreach function of RDD?
Or RDD framework is not a good idea for some custom operation like this and dataframes could work here?

I am trying to do this in PySpark. Help or guidance is greatly appreciated!

回答1:

It's easy enough if you use sets, but the trick is with grouping as sets cannot be used as keys. The alternative used here is ordering set elements and generating a string as the corresponding key:

rdd = sc.parallelize(['2 3 5', '4 5 7', '5 4 2 3'])\
      .map(lambda l: l.split())\
      .map(set)
childRdd = sc.parallelize(['2 3','5 3','4 7','5 7','5 3','2 3'])\
      .map(lambda l: l.split())\
      .map(set)

#A small utility function to make strings from sets
#the point is order so that grouping can match keys
#that's because sets aren't ordered.
def setToString(theset):
    lst = list(theset)
    lst.sort()

    return ''.join(lst)

Now find pairs where child is subset of parent

childRdd.cartesian(rdd)\
    .filter(lambda l: set(l[0]).issubset(set(l[1])))\
    .map(lambda pair: (setToString(pair[0]), pair[1]))\
    .countByKey()

For the above example, the last line returns:

defaultdict(int, {'23': 4, '35': 4, '47': 1, '57': 1})

来源：https://stackoverflow.com/questions/49468082/is-there-a-inbuilt-function-to-compare-rdds-on-specific-criteria-or-better-to-wr

标签

apache-spark

dataframe

pyspark