Improve parallelism in spark sql

问题

I have the below code. I am using pyspark 1.2.1 with python 2.7 (cpython)

for colname in shuffle_columns:
    colrdd = hive_context.sql('select %s from %s' % (colname, temp_table))
    # zip_with_random_index is expensive
    colwidx = zip_with_random_index(colrdd).map(merge_index_on_row)
    (hive_context.applySchema(colwidx, a_schema)
        .registerTempTable(a_name))

The thing about this code is that it only operates on one column at a time. I have enough nodes in my cluster that I could be operating on many columns at once. Is there a way to do this in spark? What if I used threads or the like - could I kick off multiple of the registerTempTable (and associated collection operations) in parallel that way?

回答1:

Unfortunately, the below does not work well. It works in the sense that all of the individual iterations execute. Unfortunately, subsequent calls to the hive_context object fail due to a nullpointer exception.

This is possible with concurrent.futures:

from concurrent import futures

def make_col_temptable(colname):
    colrdd = hive_context.sql('select %s from %s' % (colname, temp_table))
    # zip_with_random_index is expensive
    colwidx = zip_with_random_index(colrdd).map(merge_index_on_row)
    (hive_context.applySchema(colwidx, a_schema)
        .registerTempTable(a_name))

with futures.ThreadPoolExecutor(max_workers=20) as executor:
    futures.wait([executor.submit(make_col_temptable, colname) for colname in shuffle_columns])

来源：https://stackoverflow.com/questions/32217242/improve-parallelism-in-spark-sql

标签

python

concurrency

apache-spark

apache-spark-sql

pyspark