I have some questions regarding parameter servers and the gradient aggregation performed. My main source is the Dive into Deep Learning book [1]. I assume the BSP model, i.e