I am new new to scala and spark and trying to understand few basic stuff out here.
Spark version used 1.5.
why does value of sum does not ge
The way you reason about the program is wrong. foreach is executed independently on each executor and modifies its own copy of sum. There is no global shared state here. Just count values directly:
df.select("column1").distinct.count
If you really want to handle this manually you'll need some type of reduce:
df.select("column1").distinct.rdd.map(_ => 1L).reduce(_ + _)