Spark: get number of cluster cores programmatically

后端 未结 4 1908
庸人自扰
庸人自扰 2020-12-09 05:35

I run my spark application in yarn cluster. In my code I use number available cores of queue for creating partitions on my dataset:

Dataset ds = ...
ds.coale         


        
4条回答
  •  旧巷少年郎
    2020-12-09 06:00

    You could run jobs on every machine and ask it for the number of cores, but that's not necessarily what's available for Spark (as pointed out by @tribbloid in a comment on another answer):

    import spark.implicits._
    import scala.collection.JavaConverters._
    import sys.process._
    val procs = (1 to 1000).toDF.map(_ => "hostname".!!.trim -> java.lang.Runtime.getRuntime.availableProcessors).collectAsList().asScala.toMap
    val nCpus = procs.values.sum
    

    Running it in the shell (on a tiny test cluster with two workers) gives:

    scala> :paste
    // Entering paste mode (ctrl-D to finish)
    
        import spark.implicits._
        import scala.collection.JavaConverters._
        import sys.process._
        val procs = (1 to 1000).toDF.map(_ => "hostname".!!.trim -> java.lang.Runtime.getRuntime.availableProcessors).collectAsList().asScala.toMap
        val nCpus = procs.values.sum
    
    // Exiting paste mode, now interpreting.
    
    import spark.implicits._                                                        
    import scala.collection.JavaConverters._
    import sys.process._
    procs: scala.collection.immutable.Map[String,Int] = Map(ip-172-31-76-201.ec2.internal -> 2, ip-172-31-74-242.ec2.internal -> 2)
    nCpus: Int = 4
    

    Add zeros to your range if you typically have lots of machines in your cluster. Even on my two-machine cluster 10000 completes in a couple seconds.

    This is probably only useful if you want more information than sc.defaultParallelism() will give you (as in @SteveC 's answer)

提交回复
热议问题