Difference between loading a csv file into RDD and Dataframe in spark

问题

I am not sure if this specific question is asked earlier or not. could be a possible duplicate but I was not able to find a use case persisting to this.

As we know that we can load a csv file directly to dataframe and can load it into RDD also and then convert that RDD to dataframe later.

RDD = sc.textFile("pathlocation")

we can apply some Map, filter and other operations on this RDD and can convert it into dataframe.

Also we can create a dataframe directly reading a csv file

Dataframe = spark.read.format("csv").schema(schema).option("header","false").load("pathlocation")

My question is that what could be the use cases when we have to load a file using RDD first and convert it into dataframe?

I just know that textFile reads data line by line. What could be the scenarios when we have to choose RDD method over dataframe?

回答1:

DataFrames / Datasets offer huge performance improvement over RDDs because of 2 powerful features:

Custom Memory management (aka Project Tungsten) Data is stored in off-heap memory in binary format. This saves a lot of memory space. Also there is no Garbage Collection overhead involved. By knowing the schema of data in advance and storing efficiently in binary format, expensive java Serialization is also avoided.
Optimized Execution Plans (aka Catalyst Optimizer)
Query plans are created for execution using Spark catalyst optimiser. After an optimised execution plan is prepared going through some steps, the final execution happens internally on RDDs only but thats completely hidden from the users.

In general, you should never use RDD's unless you want to handle the low-level optimizations / serializations yourself.

Customer Partitioner implementation in PySpark, with RDD's:

def partitionFunc(key):
import random
if key == 17850 or key == 12583:
return 0
else:
return random.randint(1,2)

# You can call the Partitioner as below:
keyedRDD = rdd.keyBy(lambda row: row[6])
keyedRDD\
.partitionBy(3, partitionFunc)\
.map(lambda x: x[0])\
.glom()\
.map(lambda x: len(set(x)))\
.take(5)

回答2:

Converting an RDD to DF is mostly not adviced unless there is no API to load your data directly as a Dataframe.

This and this are two blogs which answer your question in detail. Quoting from the former,

When to use RDDs? Consider these scenarios or common use cases for using RDDs when:

you want low-level transformation and actions and control on your dataset;

your data is unstructured, such as media streams or streams of text;

you want to manipulate your data with functional programming constructs than domain specific expressions;

you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column;

and you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

来源：https://stackoverflow.com/questions/53535766/difference-between-loading-a-csv-file-into-rdd-and-dataframe-in-spark

标签

csv

apache-spark-sql

rdd