Iterating an RDD and updating a mutable collection returns an empty collection

允我心安 提交于 2019-12-01 23:17:58

Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"

Let's inspect a simplified version of the expression above:

val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
     entry <- records } 
    { list += entry }

The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:

records.foreach{ record =>     //RDD.foreach => serializes closure and executes remotely
     record.foreach{entry =>   //record.foreach => local operation on the record collection
        list += entry          // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost  
     }
}

Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?

To implement the operation above, we need to transform the data into our desired result.

I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:

def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
    val key = "year"
    val v1 = row1.getAs(colName).toString
    val v2 = row2.getAs(colName).toString
    if (v1 != v2){
        Some(DiscrepancyData(
            row1.getAs(key).toString, //fieldKey
            colName, //fieldName
            v1, //table1Value
            v2, //table2Value
            v2) //expectedValue
        )
    } else None
}

Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:

val discrepancies = table.flatMap{case (str, row) =>
    compareCols.flatMap{col => compareFields(col, row.next, row.next) }   
}

We can also use the for-comprehension notation, now that we understand where things are running:

val discrepancies = for {
    (str,row) <- table
    col <- compareCols
    dis <- compareFields(col, row.next, row.next)
} yield dis

Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:

val materializedDiscrepancies = discrepancies.collect()

Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.

Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.

To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.

Try to have a more functional implementation along these lines:

val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???) 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!