Improve parallelism in spark sql

独自空忆成欢 提交于 2019-12-24 11:35:08

问题


I have the below code. I am using pyspark 1.2.1 with python 2.7 (cpython)

for colname in shuffle_columns:
    colrdd = hive_context.sql('select %s from %s' % (colname, temp_table))
    # zip_with_random_index is expensive
    colwidx = zip_with_random_index(colrdd).map(merge_index_on_row)
    (hive_context.applySchema(colwidx, a_schema)
        .registerTempTable(a_name))

The thing about this code is that it only operates on one column at a time. I have enough nodes in my cluster that I could be operating on many columns at once. Is there a way to do this in spark? What if I used threads or the like - could I kick off multiple of the registerTempTable (and associated collection operations) in parallel that way?


回答1:


Unfortunately, the below does not work well. It works in the sense that all of the individual iterations execute. Unfortunately, subsequent calls to the hive_context object fail due to a nullpointer exception.


This is possible with concurrent.futures:

from concurrent import futures

def make_col_temptable(colname):
    colrdd = hive_context.sql('select %s from %s' % (colname, temp_table))
    # zip_with_random_index is expensive
    colwidx = zip_with_random_index(colrdd).map(merge_index_on_row)
    (hive_context.applySchema(colwidx, a_schema)
        .registerTempTable(a_name))

with futures.ThreadPoolExecutor(max_workers=20) as executor:
    futures.wait([executor.submit(make_col_temptable, colname) for colname in shuffle_columns])


来源:https://stackoverflow.com/questions/32217242/improve-parallelism-in-spark-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!