问题
The original dataset is:
# (numbersofrating,title,avg_rating)
newRDD =[(3,'monster',4),(4,'minions 3D',5),....]
I want to select top N avg_ratings in newRDD.I use the following code,it has an error.
selectnewRDD = (newRDD.map(x, key =lambda x: x[2]).sortBy(......))
TypeError: map() takes no keyword arguments
The expected data should be:
# (numbersofrating,title,avg_rating)
selectnewRDD =[(4,'minions 3D',5),(3,'monster',4)....]
回答1:
You can use either top
or takeOrdered
with key
argument:
newRDD.top(2, key=lambda x: x[2])
or
newRDD.takeOrdered(2, key=lambda x: -x[2])
Note that top
is taking elements in descending order and takeOrdered
in ascending so key
function is different in both cases.
回答2:
Have you tried using top? Given that you want the top avg ratings (and it is the third item in the tuple), you'll need to assign it to the key using a lambda
function.
# items = (number_of_ratings, title, avg_rating)
newRDD = sc.parallelize([(3, 'monster', 4), (4, 'minions 3D', 5)])
top_n = 10
>>> newRDD.top(top_n, key=lambda items: items[2])
[(4, 'minions 3D', 5), (3, 'monster', 4)]
来源:https://stackoverflow.com/questions/31882221/spark-select-top-values-in-rdd