How to Distribute Multiprocessing Pool to Spark Workers

时光总嘲笑我的痴心妄想 提交于 2019-12-31 02:34:06

问题


I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR master node in AWS. (Eventually it will be 100k csv files hence the need for distributed reading).

import findspark
import boto3
from multiprocessing.pool import ThreadPool
import logging
import sys
findspark.init()
from pyspark import SparkContext, SparkConf, sql
conf = SparkConf().setMaster("local[*]")
conf.set('spark.scheduler.mode', 'FAIR')
sc = SparkContext.getOrCreate(conf)
spark = sql.SparkSession.builder.master("local[*]").appName("ETL").getOrCreate()
s3 = boto3.resource(...)
bucket = ''
bucketObj = s3.Bucket(bucket)
numNodes = 64
def processTest(key):
    logger.info(key + ' ---- Start\n')
    fLog = spark.read.option("header", "true") \
                         .option("inferSchema", "true") \
                         .csv(buildS3Path(bucket) + key)
    logger.info(key + ' ---- Finish Read\n')
    fLog = renameColumns(NAME_MAP, fLog)
    logger.info(key + ' ---- Finish Rename\n')
    (landLog, flags) = validate(fLog)
    logger.info(key + ' ---- Finish Validation\n')

files = list(bucketObj.objects.filter(Prefix=subfolder))
keys = list(map(lambda obj: obj.key, files))
keys = keys
# files = s3C.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
p = ThreadPool(numNodes)
p.map(process, keys)

It runs fine except that it only uses the master node.

The blue line is the CPU usage on my master node. All the logs show that I'm running on one machine:

 INFO:pyspark:172.31.29.33

How do I make spark distribute the pool to the workers?


回答1:


In a closer read of the SparkSession.Builder API Docs, the string passed to the SparkSession.builder.master('xxxx') is the host in the connection to the master node via: spark://xxxx:7077. Like user8371915 said, I needed to not be on a standalone local master. Instead this fix worked like a charm:

SparkSession.builder.master('yarn')

https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/SparkSession.Builder.html#master-java.lang.String-



来源:https://stackoverflow.com/questions/50729345/how-to-distribute-multiprocessing-pool-to-spark-workers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!