I need to process multiple files scattered across various directories. I would like to load all these up in a single RDD and then perform map/reduce on it. I see that SparkC
How about this phrasing instead?
sc.union([sc.textFile(basepath + "/" + f) for f in files])
In Scala SparkContext.union()
has two variants, one that takes vararg arguments, and one that takes a list. Only the second one exists in Python (since Python does not have polymorphism).
UPDATE
You can use a single textFile
call to read multiple files.
sc.textFile(','.join(files))