How to read a CSV file with multiple delimiter in spark

问题

I am trying to read a CSV file using spark 1.6

s.no|Name$id|designation|salry
1   |abc$12 |xxx        |yyy

val df = spark.read.format("csv")
  .option("header","true")
  .option("delimiter","|")
  .load("path")

if I add delimiter with $ it throwing error one delimiter permitted

回答1:

You can apply operation once dataframe is created after reading it from source with primary delimiter ( I am referring "|" as primary delimiter for better understanding).

You can do something like below:

sc is the Sparksession

val inputDF = sc.read.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "|")
.csv("/path/to/your/file")

val modifiedDF = inputDF
.withColumn("Name", split(inputDF.col("Name$id"), "\\$")(0))
.withColumn("id", split(inputDF.col("Name$id"), "\\$")(1)).drop("Name$id")

modifiedDF.show(false) will give you the required output

Although this might result in data getting wrongly splitted in case there is valid "$" sign in the data which is mistaken as the delimiter. Hence you should use precaution in these scenarios.

There is one library, don't remember its name but it could be univocity which gives you the option of treating multiple symbols as single delimiter like #@ as delimiter. You can google a little in case your use case is for multiple delimiter for each and every column.

回答2:

might I ask why are you using spark 1.6? anyways only one delimiter is allowed when reading a csv format.

if its a specific column which you know has one column with values in the format: name$id

maybe try to run some logic on that column and get a df with 2 new columns

setting up the df

al df = sc.parallelize(a).toDF("nameid")
df: org.apache.spark.sql.DataFrame = [nameid: string]

try something like this:

df.withColumn("name",substring_index(col("nameid"), "$", 1)).withColumn("id", substring_index(col("nameid"), "$", -1)).show

and the output

+-------+----+---+
| nameid|name| id|
+-------+----+---+
|name$id|name| id|
+-------+----+---+

you can drop the original column after that also

hope this helped

来源：https://stackoverflow.com/questions/61057630/how-to-read-a-csv-file-with-multiple-delimiter-in-spark

标签

apache-spark

apache-spark-sql