I have a CSV file that is structured this way:
Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"
I have two proble
For your first problem, just zip the lines in the RDD with zipWithIndex
and filter the lines you don't want.
For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ","
.
rdd = sc.textFile("myfile.csv")
rdd.zipWithIndex().
filter(lambda x: x[1] > 2).
map(lambda x: x[0]).
map(lambda x: x.strip('"').split('","')).
toDF(["Col1", "Col2"])
Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.