How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端未结

关注

 5  906

执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答

时光取名叫无心 (楼主)

2020-12-11 17:01

Try to use csv.reader with 'quotechar' parameter.It will split the line correctly. After that you can add filters as you like.

import csv
from pyspark.sql.types import StringType

df = sc.textFile("test2.csv")\
           .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
           .toDF(['Col1','Col2'])

0 讨论(0)

查看其它5个回答