How to use double pipe as delimiter in CSV?

问题

Spark 1.5 and Scala 2.10.6

I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.

My Code:

val customSchema_1 = StructType(Array(
    StructField("ID", StringType, true), 
    StructField("FILLER", StringType, true), 
    StructField("CODE", StringType, true)));

val df_1 = sqlContext.read
    .format("com.databricks.spark.csv")
    .schema(customSchema_1)
    .option("delimiter", "¦¦")
    .load("example.txt")

Sample file:

12345¦¦  ¦¦10

回答1:

So the actual error being emitted here is:

java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦

The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.

Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:

scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
    StructField("ID", StringType, true), 
    StructField("FILLER", StringType, true), 
    StructField("CODE", StringType, true)));

// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))

scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34

scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]

scala> df.show
+-----+------+----+
|   ID|FILLER|CODE|
+-----+------+----+
|12345|      |  10|
+-----+------+----+

回答2:

I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.

var df = spark.sqlContext.read
  .option("header", "true")
  .option("inferSchema", "true")
  .option("delimiter", "\t")
  .csv(spark.sqlContext.read.textFile("filename")
      .map(line => line.split("\\|\\|").mkString("\t")))

Hope this helps some else.

回答3:

We tried to read data having custom delimiters and customizing column names for data frame in following way,

# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>

# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")

## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)

Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.

Hope this helps.

回答4:

Starting from Spark2.8 and above support of multiple character delimiter has been added. https://issues.apache.org/jira/browse/SPARK-24540

The above solution proposed by @lockwobr works in scala. Whoever working below Spark 2.8 and looking out for solution in PySpark you can refer to the below

ratings_schema = StructType([
                                  StructField("user_id", StringType(), False)
                                , StructField("movie_id", StringType(), False)
                                , StructField("rating", StringType(), False)
                                , StructField("rating_timestamp", StringType(), True)
                                ])

    #movies_df = spark.read.csv("ratings.dat", header=False, sep="::", schema=ratings_schema)

    movies_df = spark.createDataFrame(
            spark.read.text("ratings.dat").rdd.map(lambda line: line[0].split("::")),
            ratings_schema)

i have provided an example but you can modify it for your logic.

来源：https://stackoverflow.com/questions/41268182/how-to-use-double-pipe-as-delimiter-in-csv

标签

scala

apache-spark