问题
I am trying to read a CSV file using spark 1.6
s.no|Name$id|designation|salry
1 |abc$12 |xxx |yyy
val df = spark.read.format("csv")
.option("header","true")
.option("delimiter","|")
.load("path")
if I add delimiter with $ it throwing error one delimiter permitted
回答1:
You can apply operation once dataframe is created after reading it from source with primary delimiter ( I am referring "|" as primary delimiter for better understanding).
You can do something like below:
sc is the Sparksession
val inputDF = sc.read.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "|")
.csv("/path/to/your/file")
val modifiedDF = inputDF
.withColumn("Name", split(inputDF.col("Name$id"), "\\$")(0))
.withColumn("id", split(inputDF.col("Name$id"), "\\$")(1)).drop("Name$id")
modifiedDF.show(false) will give you the required output
Although this might result in data getting wrongly splitted in case there is valid "$" sign in the data which is mistaken as the delimiter. Hence you should use precaution in these scenarios.
There is one library, don't remember its name but it could be univocity which gives you the option of treating multiple symbols as single delimiter like #@ as delimiter. You can google a little in case your use case is for multiple delimiter for each and every column.
回答2:
might I ask why are you using spark 1.6? anyways only one delimiter is allowed when reading a csv format.
if its a specific column which you know has one column with values in the format: name$id
maybe try to run some logic on that column and get a df with 2 new columns
setting up the df
al df = sc.parallelize(a).toDF("nameid")
df: org.apache.spark.sql.DataFrame = [nameid: string]
try something like this:
df.withColumn("name",substring_index(col("nameid"), "$", 1)).withColumn("id", substring_index(col("nameid"), "$", -1)).show
and the output
+-------+----+---+
| nameid|name| id|
+-------+----+---+
|name$id|name| id|
+-------+----+---+
you can drop the original column after that also
hope this helped
来源:https://stackoverflow.com/questions/61057630/how-to-read-a-csv-file-with-multiple-delimiter-in-spark