How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端未结

关注

 5  910

执念已碎 2020-12-11 16:35

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

5条回答

温柔的废话 (楼主)

2020-12-11 16:52

If CSV file structure always has two columns, on Scala can be implemented:

val struct = StructType(
  StructField("firstCol", StringType, nullable = true) ::
  StructField("secondCol", StringType, nullable = true) :: Nil)

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .option("quote", "\"")
  .schema(struct)
  .load("myFile.csv")

df.show(false)

val indexed = df.withColumn("index", monotonicallyIncreasingId())
val filtered = indexed.filter(col("index") > 2).drop("index")

filtered.show(false)

Result is:

+---------+---------+
|firstCol |secondCol|
+---------+---------+
|Header   |null     |
|Blank Row|null     |
|Col1     |Col2     |
|1,200    |1,456    |
|2,000    |3,450    |
+---------+---------+

+--------+---------+
|firstCol|secondCol|
+--------+---------+
|1,200   |1,456    |
|2,000   |3,450    |
+--------+---------+

0 讨论(0)

查看其它5个回答