How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端未结

关注

 5  919

执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答

情书的邮戳 (楼主)

2020-12-11 16:59

Answer by Zlidime had the right idea. The working solution is this:

import csv

customSchema = StructType([ \
    StructField("Col1", StringType(), True), \
    StructField("Col2", StringType(), True)])

df = sc.textFile("file.csv")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
        .toDF(customSchema)

0 讨论(0)

查看其它5个回答