Can I read a CSV represented as a string into Apache Spark using spark-csv

后端 未结 3 1707
你的背包
你的背包 2020-12-05 08:57

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to

3条回答
  •  悲哀的现实
    2020-12-05 09:30

    You can parse your string into a csv using, e.g. scala-csv:

    val myCSVdata : Array[List[String]] = myCSVString.split('\n').flatMap(CSVParser.parseLine(_))

    Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...

    You can then make this an RDD of records:

    val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)

    Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:

    https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection

    I omit this step.

    You can then convert to a DataFrame:

    import spark.implicits._ myCSVDataframe = myCSVRDD.toDF()

提交回复
热议问题