How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端 未结 5 919
执念已碎
执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答
  •  情书的邮戳
    2020-12-11 16:59

    Answer by Zlidime had the right idea. The working solution is this:

    import csv
    
    customSchema = StructType([ \
        StructField("Col1", StringType(), True), \
        StructField("Col2", StringType(), True)])
    
    df = sc.textFile("file.csv")\
            .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
            .toDF(customSchema)
    

提交回复
热议问题