How to find spark RDD/Dataframe size?

为君一笑 提交于 2019-11-26 17:35:36

问题


I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

Spark:

val distFile = sc.textFile(file)
println(distFile.length)

but if i process it not getting file size. How to find the RDD size?


回答1:


If you are simply looking to count the number of rows in the rdd, do:

val distFile = sc.textFile(file)
println(distFile.count)

If you are interested in the bytes, you can use the SizeEstimator:

import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html




回答2:


Yes Finally I got the solution. Include these libraries.

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

How to find the RDD Size:

def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

Function to find DataFrame size: (This function just convert DataFrame to RDD internally)

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)



回答3:


Below is one way apart from SizeEstimator that I use frequently.

To know from the code whether a RDD is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk, but also to know the actual cache consumption, you could use SparkContext developer api method getRDDStorageInfo().

Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

For Example :

scala> sc.getRDDStorageInfo
       res3: Array[org.apache.spark.storage.RDDInfo] = 
       Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); 
       CachedPartitions: 1; TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

Seems like spark ui also used the same from this code

  • See this Source issue SPARK-17019 which describes...

Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:

  1. Spark UI's executor page will display both on-heap and off-heap memory usage.
  2. REST request returns both on-heap and off-heap memory.
  3. Also these two memory usage can be obtained programmatically from SparkListener.


来源:https://stackoverflow.com/questions/35008123/how-to-find-spark-rdd-dataframe-size

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!