pyspark Loading multiple partitioned files in a single load

时光怂恿深爱的人放手 提交于 2019-12-02 07:49:15

问题


I am trying to load multiple files in a single load. They are all partitioned files When I tried it with 1 file it works, but when I listed down 24 files, it gives me this error and I could not find any documentation of the limitation and a workaround aside from doing the union after the load. Is there any alternatives?

CODE Below to re-create the problem:

basepath = '/file/' 
paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', ]   

df = sqlContext.read.format('orc') \
               options(header='true',inferschema='true',basePath=basePath)\
               .load(*paths)

ERROR RECEIVED :

 TypeError                                 Traceback (most recent call last)
 <ipython-input-43-7fb8fade5e19> in <module>()

---> 37 df = sqlContext.read.format('orc')                .options(header='true', inferschema='true',basePath=basePath)                .load(*paths)
     38 

TypeError: load() takes at most 4 arguments (24 given)

回答1:


As explained in the official documentation, to read multiple files, you should pass a list:

path – optional string or a list of string for file-system backed data sources.

So in your case:

(sqlContext.read
    .format('orc') 
    .options(basePath=basePath)
    .load(path=paths))

Argument unpacking (*) would makes sense only if load was defined with variadic arguments, form example:

def load(this, *paths):
    ...


来源:https://stackoverflow.com/questions/48344580/pyspark-loading-multiple-partitioned-files-in-a-single-load

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!