How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端 未结 5 910
执念已碎
执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答
  •  温柔的废话
    2020-12-11 16:52

    If CSV file structure always has two columns, on Scala can be implemented:

    val struct = StructType(
      StructField("firstCol", StringType, nullable = true) ::
      StructField("secondCol", StringType, nullable = true) :: Nil)
    
    val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "false")
      .option("inferSchema", "false")
      .option("delimiter", ",")
      .option("quote", "\"")
      .schema(struct)
      .load("myFile.csv")
    
    df.show(false)
    
    val indexed = df.withColumn("index", monotonicallyIncreasingId())
    val filtered = indexed.filter(col("index") > 2).drop("index")
    
    filtered.show(false)
    

    Result is:

    +---------+---------+
    |firstCol |secondCol|
    +---------+---------+
    |Header   |null     |
    |Blank Row|null     |
    |Col1     |Col2     |
    |1,200    |1,456    |
    |2,000    |3,450    |
    +---------+---------+
    
    +--------+---------+
    |firstCol|secondCol|
    +--------+---------+
    |1,200   |1,456    |
    |2,000   |3,450    |
    +--------+---------+
    

提交回复
热议问题