How to skip lines while reading a CSV file as a dataFrame using PySpark?

前端未结

关注

 5  916

I have a CSV file that is structured this way:

Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"

I have two proble

相关标签:

5条回答

温柔的废话

2020-12-11 16:52

If CSV file structure always has two columns, on Scala can be implemented:

val struct = StructType(
  StructField("firstCol", StringType, nullable = true) ::
  StructField("secondCol", StringType, nullable = true) :: Nil)

val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "false")
  .option("inferSchema", "false")
  .option("delimiter", ",")
  .option("quote", "\"")
  .schema(struct)
  .load("myFile.csv")

df.show(false)

val indexed = df.withColumn("index", monotonicallyIncreasingId())
val filtered = indexed.filter(col("index") > 2).drop("index")

filtered.show(false)

Result is:

+---------+---------+
|firstCol |secondCol|
+---------+---------+
|Header   |null     |
|Blank Row|null     |
|Col1     |Col2     |
|1,200    |1,456    |
|2,000    |3,450    |
+---------+---------+

+--------+---------+
|firstCol|secondCol|
+--------+---------+
|1,200   |1,456    |
|2,000   |3,450    |
+--------+---------+

0 讨论(0)

悲哀的现实

2020-12-11 16:58
Why don't you just try the DataFrameReader API from pyspark.sql? It is pretty easy. For this problem, I guess this single line would be good enough.
```
df = spark.read.csv("myFile.csv") # By default, quote char is " and separator is ','
```
With this API, you can also play around with few other parameters like header lines, ignoring leading and trailing whitespaces. Here is the link: DataFrameReader API
0 讨论(0)
发布评论:

提交评论
- 加载中...

情书的邮戳

2020-12-11 16:59

Answer by Zlidime had the right idea. The working solution is this:

import csv

customSchema = StructType([ \
    StructField("Col1", StringType(), True), \
    StructField("Col2", StringType(), True)])

df = sc.textFile("file.csv")\
        .mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
        .toDF(customSchema)

0 讨论(0)

时光取名叫无心

2020-12-11 17:01

Try to use csv.reader with 'quotechar' parameter.It will split the line correctly. After that you can add filters as you like.

import csv
from pyspark.sql.types import StringType

df = sc.textFile("test2.csv")\
           .mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
           .toDF(['Col1','Col2'])

0 讨论(0)

伪装坚强ぢ

2020-12-11 17:10
For your first problem, just zip the lines in the RDD with zipWithIndex and filter the lines you don't want. For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ",".
```
rdd = sc.textFile("myfile.csv")
rdd.zipWithIndex().
    filter(lambda x: x[1] > 2).
    map(lambda x: x[0]).
    map(lambda x: x.strip('"').split('","')).
    toDF(["Col1", "Col2"])
```
Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.
0 讨论(0)
发布评论:

提交评论
- 加载中...