How to Distribute Multiprocessing Pool to Spark Workers

问题

I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR master node in AWS. (Eventually it will be 100k csv files hence the need for distributed reading).

import findspark
import boto3
from multiprocessing.pool import ThreadPool
import logging
import sys
findspark.init()
from pyspark import SparkContext, SparkConf, sql
conf = SparkConf().setMaster("local[*]")
conf.set('spark.scheduler.mode', 'FAIR')
sc = SparkContext.getOrCreate(conf)
spark = sql.SparkSession.builder.master("local[*]").appName("ETL").getOrCreate()
s3 = boto3.resource(...)
bucket = ''
bucketObj = s3.Bucket(bucket)
numNodes = 64
def processTest(key):
    logger.info(key + ' ---- Start\n')
    fLog = spark.read.option("header", "true") \
                         .option("inferSchema", "true") \
                         .csv(buildS3Path(bucket) + key)
    logger.info(key + ' ---- Finish Read\n')
    fLog = renameColumns(NAME_MAP, fLog)
    logger.info(key + ' ---- Finish Rename\n')
    (landLog, flags) = validate(fLog)
    logger.info(key + ' ---- Finish Validation\n')

files = list(bucketObj.objects.filter(Prefix=subfolder))
keys = list(map(lambda obj: obj.key, files))
keys = keys
# files = s3C.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
p = ThreadPool(numNodes)
p.map(process, keys)

It runs fine except that it only uses the master node.

The blue line is the CPU usage on my master node. All the logs show that I'm running on one machine:

 INFO:pyspark:172.31.29.33

How do I make spark distribute the pool to the workers?

回答1:

In a closer read of the SparkSession.Builder API Docs, the string passed to the SparkSession.builder.master('xxxx') is the host in the connection to the master node via: spark://xxxx:7077. Like user8371915 said, I needed to not be on a standalone local master. Instead this fix worked like a charm:

SparkSession.builder.master('yarn')

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#master-java.lang.String-

来源：https://stackoverflow.com/questions/50729345/how-to-distribute-multiprocessing-pool-to-spark-workers

标签

python

apache-spark

pyspark

multiprocessing