How to find spark RDD/Dataframe size?

问题

I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?

Scala:

object Main extends App {
  val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
  println(file.length)
}

Spark:

val distFile = sc.textFile(file)
println(distFile.length)

but if i process it not getting file size. How to find the RDD size?

回答1:

If you are simply looking to count the number of rows in the rdd, do:

val distFile = sc.textFile(file)
println(distFile.count)

If you are interested in the bytes, you can use the SizeEstimator:

import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))

https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html

回答2:

Yes Finally I got the solution. Include these libraries.

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd

How to find the RDD Size:

def calcRDDSize(rdd: RDD[String]): Long = {
  rdd.map(_.getBytes("UTF-8").length.toLong)
     .reduce(_+_) //add the sizes together
}

Function to find DataFrame size: (This function just convert DataFrame to RDD internally)

val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path

val rddOfDataframe = dataFrame.rdd.map(_.toString())

val size = calcRDDSize(rddOfDataframe)

回答3:

Below is one way apart from SizeEstimator that I use frequently.

To know from the code whether a RDD is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk, but also to know the actual cache consumption, you could use SparkContext developer api method getRDDStorageInfo().

Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.

For Example :

scala> sc.getRDDStorageInfo
       res3: Array[org.apache.spark.storage.RDDInfo] = 
       Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); 
       CachedPartitions: 1; TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

Seems like spark ui also used the same from this code

See this Source issue SPARK-17019 which describes...

Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:

Spark UI's executor page will display both on-heap and off-heap memory usage.

REST request returns both on-heap and off-heap memory.

Also these two memory usage can be obtained programmatically from SparkListener.

来源：https://stackoverflow.com/questions/35008123/how-to-find-spark-rdd-dataframe-size

标签

scala

apache-spark

rdd