How to determine if object is a valid key-value pair in PySpark

倖福魔咒の 提交于 2019-11-26 17:26:31

问题


  1. If I have a rdd, how do I understand the data is in key:value format? is there a way to find the same - something like type(object) tells me an object's type. I tried print type(rdd.take(1)), but it just says <type 'list'>.
  2. Let's say I have a data like (x,1),(x,2),(y,1),(y,3) and I use groupByKey and got (x,(1,2)),(y,(1,3)). Is there a way to define (1,2) and (1,3) as values where x and y are keys? Or does a key has to be a single value? I noted that if I use reduceByKey and sum function to get the data ((x,3),(y,4)) then it becomes much easier to define this data as a key-value pair

回答1:


Python is a dynamically typed language and PySpark doesn't use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:

k, v = kv

Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:

key_value.py

class KeyValue(object):
    def __init__(self, k, v):
        self.k = k
        self.v = v
    def __iter__(self):
       for x in [self.k, self.v]:
           yield x
from key_value import KeyValue

rdd = sc.parallelize(
    [KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)]) 

rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]

and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.

Also type(rdd.take(1)) returns a list of length n so its type will be always the same.



来源:https://stackoverflow.com/questions/35703298/how-to-determine-if-object-is-a-valid-key-value-pair-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!