How to read a CSV file with multiple delimiter in spark

本小妞迷上赌 提交于 2021-01-29 06:00:07

问题


I am trying to read a CSV file using spark 1.6

s.no|Name$id|designation|salry
1   |abc$12 |xxx        |yyy
val df = spark.read.format("csv")
  .option("header","true")
  .option("delimiter","|")
  .load("path")

if I add delimiter with $ it throwing error one delimiter permitted


回答1:


You can apply operation once dataframe is created after reading it from source with primary delimiter ( I am referring "|" as primary delimiter for better understanding).

You can do something like below:

sc is the Sparksession

val inputDF = sc.read.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "|")
.csv("/path/to/your/file")

val modifiedDF = inputDF
.withColumn("Name", split(inputDF.col("Name$id"), "\\$")(0))
.withColumn("id", split(inputDF.col("Name$id"), "\\$")(1)).drop("Name$id")

modifiedDF.show(false) will give you the required output

Although this might result in data getting wrongly splitted in case there is valid "$" sign in the data which is mistaken as the delimiter. Hence you should use precaution in these scenarios.

There is one library, don't remember its name but it could be univocity which gives you the option of treating multiple symbols as single delimiter like #@ as delimiter. You can google a little in case your use case is for multiple delimiter for each and every column.




回答2:


might I ask why are you using spark 1.6? anyways only one delimiter is allowed when reading a csv format.

if its a specific column which you know has one column with values in the format: name$id

maybe try to run some logic on that column and get a df with 2 new columns

setting up the df

al df = sc.parallelize(a).toDF("nameid")
df: org.apache.spark.sql.DataFrame = [nameid: string]

try something like this:

df.withColumn("name",substring_index(col("nameid"), "$", 1)).withColumn("id", substring_index(col("nameid"), "$", -1)).show

and the output

+-------+----+---+
| nameid|name| id|
+-------+----+---+
|name$id|name| id|
+-------+----+---+

you can drop the original column after that also

hope this helped



来源:https://stackoverflow.com/questions/61057630/how-to-read-a-csv-file-with-multiple-delimiter-in-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!