How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端 未结 5 906
执念已碎
执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答
  •  时光取名叫无心
    2020-12-11 17:01

    Try to use csv.reader with 'quotechar' parameter.It will split the line correctly. After that you can add filters as you like.

    import csv
    from pyspark.sql.types import StringType
    
    df = sc.textFile("test2.csv")\
               .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
               .toDF(['Col1','Col2'])
    

提交回复
热议问题