I\'m using a broadcast variable about 100 MB pickled in size, which I\'m approximating with:
>>> data = list(range(int(10*1e6)))
>>> import
Well, the devil is in the detail. To understand the reason why this may happen we'll have to take a closer look at the PySpark serializers. First lets create SparkContext
with default settings:
from pyspark import SparkContext
sc = SparkContext("local", "foo")
and check what is a default serializer:
sc.serializer
## AutoBatchedSerializer(PickleSerializer())
sc.serializer.bestSize
## 65536
It tells us three different things:
AutoBatchedSerializer
serializerPickleSerializer
to perform actual jobbestSize
of the serialized batched is 65536 bytes A quick glance at the source code will show you that this serialize adjusts number of records serialized at the time on the runtime and tries to keep batch size less than 10 * bestSize
. The important point is that not all records in the single partition are serialized at the same time.
We can check that experimentally as follows:
from operator import add
bd = sc.broadcast({})
rdd = sc.parallelize(range(10), 1).map(lambda _: bd.value)
rdd.map(id).distinct().count()
## 1
rdd.cache().count()
## 10
rdd.map(id).distinct().count()
## 2
As you can see even in this simple example after serialization-deserialization we get two distinct objects. You can observe similar behavior working directly with pickle
:
v = {}
vs = [v, v, v, v]
v1, *_, v4 = pickle.loads(pickle.dumps(vs))
v1 is v4
## True
(v1_, v2_), (v3_, v4_) = (
pickle.loads(pickle.dumps(vs[:2])),
pickle.loads(pickle.dumps(vs[2:]))
)
v1_ is v4_
## False
v3_ is v4_
## True
Values serialized in the same batch reference, after unpickling, the same object. Values from different batches point to different objects.
In practice Spark multiple serializes and different serialization strategies. You can for example use batches of infinite size:
from pyspark.serializers import BatchedSerializer, PickleSerializer
rdd_ = (sc.parallelize(range(10), 1).map(lambda _: bd.value)
._reserialize(BatchedSerializer(PickleSerializer())))
rdd_.cache().count()
rdd_.map(id).distinct().count()
## 1
You can change serializer by passing serializer
and / or batchSize
parameters to SparkContext
constructor:
sc = SparkContext(
"local", "bar",
serializer=PickleSerializer(), # Default serializer
# Unlimited batch size -> BatchedSerializer instead of AutoBatchedSerializer
batchSize=-1
)
sc.serializer
## BatchedSerializer(PickleSerializer(), -1)
Choosing different serializers and batching strategies results in different trade-offs (speed, ability to serialize arbitrary objects, memory requirements, etc.).
You should also remember that broadcast variables in Spark are not shared between executor threads so on the same worker can exist multiple deserialized copies at the same time.
Moreover you'll see a similar behavior to this if you execute a transformation which requires shuffling.