发表新帖

发表新帖

How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端未结

关注

 5  911

执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答

伪装坚强ぢ (楼主)

2020-12-11 17:10
For your first problem, just zip the lines in the RDD with zipWithIndex and filter the lines you don't want. For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ",".
```
rdd = sc.textFile("myfile.csv")
rdd.zipWithIndex().
    filter(lambda x: x[1] > 2).
    map(lambda x: x[0]).
    map(lambda x: x.strip('"').split('","')).
    toDF(["Col1", "Col2"])
```
Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题