How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

问题

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.

I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?

回答1:

If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\u0001")
    .load("cars.csv")

回答2:

With Spark 2.x and the CSV API, use the sep option:

val df = spark.read
  .option("sep", "\u0001")
  .csv("path_to_csv_files")

来源：https://stackoverflow.com/questions/36007686/how-to-parse-a-csv-that-uses-a-i-e-001-as-the-delimiter-with-spark-csv

标签

scala

apache-spark

Hive

delimiter

spark-csv

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!