Difference between sc.textFile and spark.read.text in Spark

问题

I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so :

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
textRDD1 = sc.textFile("hobbit.txt")
textRDD2 = spark.read.text('hobbit.txt').rdd

then I look into the data and see that the two RDDs are structured differently

textRDD1.take(5)

['The king beneath the mountain',
 'The king of carven stone',
 'The lord of silver fountain',
 'Shall come unto his own',
 'His throne shall be upholden']

textRDD2.take(5)

[Row(value='The king beneath the mountain'),
 Row(value='The king of carven stone'),
 Row(value='The lord of silver fountain'),
 Row(value='Shall come unto his own'),
 Row(value='His throne shall be upholden')]

Based on this, all subsequent processing has to be changed to reflect the presence of the 'value'

My questions are

What is the implication of using these two ways of reading a text file?
Under what circumstances should we use which method?

回答1:

To answer (a),

sc.textFile(...) returns a RDD[String]

textFile(String path, int minPartitions)
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

spark.read.text(...) returns a DataSet[Row] or a DataFrame

text(String path)
Loads text files and returns a DataFrame whose schema starts with a string column named "value", and followed by partitioned columns if there are any.

For (b), it really depends on your use case. Since you are trying to create a RDD here, you should go with sc.textFile. You can always convert a dataframe to a rdd and vice-versa.

来源：https://stackoverflow.com/questions/52665353/difference-between-sc-textfile-and-spark-read-text-in-spark

标签

apache-spark

rdd