Spark task runs on only one executor

本小妞迷上赌 提交于 2021-01-28 06:01:32

问题


Hello everyone first and foremost i'm aware of the existence of this thread, Task is running on only one executor in spark.
However this is not my case as i'm using repartition(n) on my dataframe.
Basically i'm loading a DataFrame by fetching data from an ElasticSearch index through Spark as follows:

spark = SparkSession.builder \
    .appName("elastic") \
    .master("yarn")\
    .config('spark.submit.deployMode','client')\
    .config("spark.jars",pathElkJar) \
    .enableHiveSupport() \
    .getOrCreate()

es_reader = (spark.read
        .format("org.elasticsearch.spark.sql")
        .option("es.read.field.include",includeFieldsString)
        .option("es.query",q)
        .option("es.nodes",elasticClusterIP)
        .option("es.port",port)
        .option("es.resource",indexNameTable)
        .option("es.nodes.wan.only" , 'true')
        .option("es.net.ssl", 'true')
        .option("es.net.ssl.cert.allow.self.signed", "true")
        .option("es.net.http.auth.user" ,elkUser )
        .option("es.net.http.auth.pass" , elkPassword)
        .option("es.read.metadata", "false")
        .option("es.read.field.as.array.include","system_auth_hostname")
        #.option("es.mapping.exclude", "index")
        #.option("es.mapping.id", "_id")
        #.option("es.read.metadata._id","_id")
        #.option("delimiter", ",")
        #.option("inferSchema","true")
        #.option("first_row_is_header","true")
        )

df = es_reader.load()

YARN correctly adds by default 2 executors to my application as i did not specify else. DF is not partitioned when loading data from ElasticSearch therefore i ran the following to check executor behavior:

df = df.repartition(2)
print('Number of partitions: {}'.format(df.rdd.getNumPartitions()))
>> Number of partitions: 2
df.count()

I expected to see from Spark UI both executor working on the count() task, however i get a strange behavior where i have three task completed, without repartitioning it's two task for an action, where the first and longer one is ran by only one executor as you can see in the following image: Count()-twoExecutor-twoPartitions.

Things work as expected if i save and load from an hive table already partitioned in two (OS:linux/windows):

df.write.mode('overwrite').format("parquet").partitionBy('OS').saveAsTable('test_executors')
df2 = spark.read.load("path")
df2.count()

In this case i get the following: Count()-twoExecutor-twoPartitions_loadFromHiveTable

Where i get the desired behavior of two executor cuncurrently working on the count() task.
The problem seems to be lying in the repartition(n) which i think correctly partitions the DF, i checked with df.rdd.getNumPartitions() but doesn't parallelize work amongst executors.
If necessary i can provide the details to the tasks in the attached images. Thanks in advance!

来源:https://stackoverflow.com/questions/64972467/spark-task-runs-on-only-one-executor

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!