问题
Imagine I do some Spark operations on a file hosted in HDFS. Something like this:
var file = sc.textFile("hdfs://...")
val items = file.map(_.split('\t'))
...
Because in the Hadoop world the code should go where the data is, right?
So my question is: How do Spark workers know of HDFS data nodes? How does Spark know on which Data Nodes to execute the code?
回答1:
Spark reuses Hadoop classes: when you call textFile, it creates a TextInputFormat which has a getSplits method (a split is roughly a partition or block), and then each InputSplit has getLocations and getLocationInfo method.
来源:https://stackoverflow.com/questions/28481693/how-does-apache-spark-know-about-hdfs-data-nodes