How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端 未结 5 907
执念已碎
执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

相关标签:
5条回答
  • 2020-12-11 16:52

    If CSV file structure always has two columns, on Scala can be implemented:

    val struct = StructType(
      StructField("firstCol", StringType, nullable = true) ::
      StructField("secondCol", StringType, nullable = true) :: Nil)
    
    val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "false")
      .option("inferSchema", "false")
      .option("delimiter", ",")
      .option("quote", "\"")
      .schema(struct)
      .load("myFile.csv")
    
    df.show(false)
    
    val indexed = df.withColumn("index", monotonicallyIncreasingId())
    val filtered = indexed.filter(col("index") > 2).drop("index")
    
    filtered.show(false)
    

    Result is:

    +---------+---------+
    |firstCol |secondCol|
    +---------+---------+
    |Header   |null     |
    |Blank Row|null     |
    |Col1     |Col2     |
    |1,200    |1,456    |
    |2,000    |3,450    |
    +---------+---------+
    
    +--------+---------+
    |firstCol|secondCol|
    +--------+---------+
    |1,200   |1,456    |
    |2,000   |3,450    |
    +--------+---------+
    
    0 讨论(0)
  • 2020-12-11 16:58

    Why don't you just try the DataFrameReader API from pyspark.sql? It is pretty easy. For this problem, I guess this single line would be good enough.

    df = spark.read.csv("myFile.csv") # By default, quote char is " and separator is ','
    

    With this API, you can also play around with few other parameters like header lines, ignoring leading and trailing whitespaces. Here is the link: DataFrameReader API

    0 讨论(0)
  • 2020-12-11 16:59

    Answer by Zlidime had the right idea. The working solution is this:

    import csv
    
    customSchema = StructType([ \
        StructField("Col1", StringType(), True), \
        StructField("Col2", StringType(), True)])
    
    df = sc.textFile("file.csv")\
            .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
            .toDF(customSchema)
    
    0 讨论(0)
  • 2020-12-11 17:01

    Try to use csv.reader with 'quotechar' parameter.It will split the line correctly. After that you can add filters as you like.

    import csv
    from pyspark.sql.types import StringType
    
    df = sc.textFile("test2.csv")\
               .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
               .toDF(['Col1','Col2'])
    
    0 讨论(0)
  • 2020-12-11 17:10

    For your first problem, just zip the lines in the RDD with zipWithIndex and filter the lines you don't want. For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ",".

    rdd = sc.textFile("myfile.csv")
    rdd.zipWithIndex().
        filter(lambda x: x[1] > 2).
        map(lambda x: x[0]).
        map(lambda x: x.strip('"').split('","')).
        toDF(["Col1", "Col2"])
    

    Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.

    0 讨论(0)
提交回复
热议问题