I have a CSV file that is structured this way:
Header
Blank Row
\"Col1\",\"Col2\"
\"1,200\",\"1,456\"
\"2,000\",\"3,450\"
I have two proble
If CSV file structure always has two columns, on Scala can be implemented:
val struct = StructType(
StructField("firstCol", StringType, nullable = true) ::
StructField("secondCol", StringType, nullable = true) :: Nil)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("inferSchema", "false")
.option("delimiter", ",")
.option("quote", "\"")
.schema(struct)
.load("myFile.csv")
df.show(false)
val indexed = df.withColumn("index", monotonicallyIncreasingId())
val filtered = indexed.filter(col("index") > 2).drop("index")
filtered.show(false)
Result is:
+---------+---------+
|firstCol |secondCol|
+---------+---------+
|Header |null |
|Blank Row|null |
|Col1 |Col2 |
|1,200 |1,456 |
|2,000 |3,450 |
+---------+---------+
+--------+---------+
|firstCol|secondCol|
+--------+---------+
|1,200 |1,456 |
|2,000 |3,450 |
+--------+---------+
Why don't you just try the DataFrameReader
API from pyspark.sql
? It is pretty easy. For this problem, I guess this single line would be good enough.
df = spark.read.csv("myFile.csv") # By default, quote char is " and separator is ','
With this API, you can also play around with few other parameters like header lines, ignoring leading and trailing whitespaces. Here is the link: DataFrameReader API
Answer by Zlidime had the right idea. The working solution is this:
import csv
customSchema = StructType([ \
StructField("Col1", StringType(), True), \
StructField("Col2", StringType(), True)])
df = sc.textFile("file.csv")\
.mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=',', quotechar='"')).filter(lambda line: len(line) > 2 and line[0] != 'Col1')\
.toDF(customSchema)
Try to use csv.reader with 'quotechar' parameter.It will split the line correctly. After that you can add filters as you like.
import csv
from pyspark.sql.types import StringType
df = sc.textFile("test2.csv")\
.mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'Col1')\
.toDF(['Col1','Col2'])
For your first problem, just zip the lines in the RDD with zipWithIndex
and filter the lines you don't want.
For the second problem, you could try to strip the first and the last double quote characters from the lines and then split the line on ","
.
rdd = sc.textFile("myfile.csv")
rdd.zipWithIndex().
filter(lambda x: x[1] > 2).
map(lambda x: x[0]).
map(lambda x: x.strip('"').split('","')).
toDF(["Col1", "Col2"])
Although, if you're looking for a standard way to deal with CSV files in Spark, it's better to use the spark-csv package from databricks.