问题
How to do I count the occurrences of elements in child RDD occurring in Parent RDD.
Say,
I have two RDDs
Parent RDD -
['2 3 5']
['4 5 7']
['5 4 2 3']
Child RDD
['2 3','5 3','4 7','5 7','5 3','2 3']
I need something like -
[['2 3',2],['5 3',2],['4 7',1],['5 7',1],['5 3',2] ...]
Its actually finding the frequent item candidate set from the parent set.
Now, the child RDD can contain initially string elements or even lists i.e
['1 2','2 3']
or [[1,2],[2,3]]
as that's the data structure that I would implement according to what fits the best.
Question -
- Are there inbuild functions which could do something similar to what I am trying to achieve with these two RDDs? Any transformations?
- Or writing a UDF that parses each element of child and compares it to parent is needed, now my data is a lot so I doubt this would be efficient.
- In case I end up writing a UDF should I use the
foreach
function of RDD?
- In case I end up writing a UDF should I use the
- Or RDD framework is not a good idea for some custom operation like this and dataframes could work here?
I am trying to do this in PySpark. Help or guidance is greatly appreciated!
回答1:
It's easy enough if you use sets, but the trick is with grouping as sets cannot be used as keys. The alternative used here is ordering set elements and generating a string as the corresponding key:
rdd = sc.parallelize(['2 3 5', '4 5 7', '5 4 2 3'])\
.map(lambda l: l.split())\
.map(set)
childRdd = sc.parallelize(['2 3','5 3','4 7','5 7','5 3','2 3'])\
.map(lambda l: l.split())\
.map(set)
#A small utility function to make strings from sets
#the point is order so that grouping can match keys
#that's because sets aren't ordered.
def setToString(theset):
lst = list(theset)
lst.sort()
return ''.join(lst)
Now find pairs where child is subset of parent
childRdd.cartesian(rdd)\
.filter(lambda l: set(l[0]).issubset(set(l[1])))\
.map(lambda pair: (setToString(pair[0]), pair[1]))\
.countByKey()
For the above example, the last line returns:
defaultdict(int, {'23': 4, '35': 4, '47': 1, '57': 1})
来源:https://stackoverflow.com/questions/49468082/is-there-a-inbuilt-function-to-compare-rdds-on-specific-criteria-or-better-to-wr