Where does spark look for text files?

三世轮回 提交于 2021-02-06 10:18:28

问题


I thought that loading text files is done only from workers / within the cluster (you just need to make sure all workers have access to the same path, either by having that text file available on all nodes, or by use some shared folder mapped to the same path)

e.g. spark-submit / spark-shell can be launched from anywhere, and connect to a spark master, and the machine where you launched spark-submit / spark-shell (which is also where our driver runs, unless you are in "cluster" deploy mode) has nothing to do with the cluster. Therefore any data loading should be done only from the workers, not on the driver machine, right? e.g. there should be no way that sc.textFile("file:///somePath") will cause spark to look for a file on the driver machine (again, the driver is external to the cluster, e.g. in "client" deploy mode / standalone mode), right?

Well, this is what I thought too...

Our cast

  • machine A: where the driver runs
  • machine B: where both spark master and one of the workers run

Act I - The Hope

When I start a spark-shell from machine B to spark master on B I get this:

scala> sc.master
res3: String = spark://machinB:7077

scala> sc.textFile("/tmp/data/myfile.csv").count()
res4: Long = 976

Act II - The Conflict

But when I start a spark-shell from machine A, pointing to spark master on B I get this:

scala> sc.master
res2: String = spark://machineB:7077

scala> sc.textFile("/tmp/data/myfile.csv").count()
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/data/myfile.csv

And indeed /tmp/data/myfile.csv does not exist on machine A, but machine A is not on the cluster, it's just where the driver runs

Act III - The Amazement

What’s even weirder is that if I make this file available on machine A, it doesn’t throw this error anymore. (Instead it creates a job, but no tasks, and just fails due to a timeout, which is another issue that deserves a separate question)

Is there something in the way that Spark behaves that I’m missing? I thought that spark shell when connected to a remote, has nothing to do with the machine you are running on. So why does the error stops when I put that file available on machine A? It means that the location of sc.textFile includes the location of where spark-shell or spark-submit were initiated (in my case also where the driver runs)? This makes zero sense to me. but again, I'm open to learn new things.

Epilogue

tl;dr - sc.textFile("file:/somePath") running form a driver on machine A to a cluster on machines B,C,D... (driver not part of cluster)

It seems like it's looking for path file:/somePath also on the driver, is that true (or is it just me)? is that known? is that as designed?

I have a feeling that this is some weird network / VPN topology issue unique to my workplace network, but still this is what happens to me, and I'm utterly confused whether it is just me or a known behavior. (or I'm simply not getting how Spark works, which is always an option)


回答1:


So the really short version of it the answer is, if you reference "file://..." it should be accessible on all nodes in your cluster including the dirver program. Sometimes some bits of work happen on the worker. Generally the way around this is just not using local files, and instead using something like S3, HDFS, or another network filesystem. There is the sc.addFile method which can be used to distribute a file from the driver to all of the other nodes (and then you use SparkFiles.get to resolve the download location).




回答2:


Spark can look for files both locally or on HDFS.

If you'd like to read in a file using sc.textFile() and take advantage of its RDD format, then the file should sit on HDFS. If you just want to read in a file the normal way, it is the same as you do depending on the API (Scala, Java, Python).

If you submit a local file with your driver, then addFile() distributes the file to each node and SparkFiles.get() downloads the file to a local temporary file.



来源:https://stackoverflow.com/questions/32464894/where-does-spark-look-for-text-files

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!