How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端 未结 5 917
执念已碎
执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答
  •  悲哀的现实
    2020-12-11 16:58

    Why don't you just try the DataFrameReader API from pyspark.sql? It is pretty easy. For this problem, I guess this single line would be good enough.

    df = spark.read.csv("myFile.csv") # By default, quote char is " and separator is ','
    

    With this API, you can also play around with few other parameters like header lines, ignoring leading and trailing whitespaces. Here is the link: DataFrameReader API

提交回复
热议问题