pyspark: Difference performance for spark.read.format(“csv”) vs spark.read.csv

问题

Anyone knows what is the difference between spark.read.format("csv") vs spark.read.csv?

Some say "spark.read.csv" is an alias of "spark.read.format("csv")", but I saw a difference between the 2. I did an experiment executing each command below with a new pyspark session so that there is no caching.

DF1 took 42 secs while DF2 took just 10 secs. The csv file is 60+ GB.

DF1 = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("hdfs://bda-ns/user/project/xxx.csv")

DF2 = spark.read.option("header", "true").csv("hdfs://bda-ns/user/project/xxx.csv")

The reason why I dig on this issue was because I have need to do a union on 2 dataframes after filter and then write back to hdfs and it took super long time to write (still writing after 16 hrs....)

回答1:

Basically they are totally the same when you call one of them. But in you implementations are difference

With DF1, you add inferSchema option, it will slow down the process, that explains why DF1 took more time than the second

inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default, Detail document

来源：https://stackoverflow.com/questions/56895707/pyspark-difference-performance-for-spark-read-formatcsv-vs-spark-read-csv

标签

csv

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!