How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端 未结 5 911
执念已碎
执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答
  •  伪装坚强ぢ
    2020-12-11 17:10

    For your first problem, just zip the lines in the RDD with zipWithIndex and filter the lines you don't want. For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ",".

    rdd = sc.textFile("myfile.csv")
    rdd.zipWithIndex().
        filter(lambda x: x[1] > 2).
        map(lambda x: x[0]).
        map(lambda x: x.strip('"').split('","')).
        toDF(["Col1", "Col2"])
    

    Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.

提交回复
热议问题