问题
What is the best way to read a csv formatted result from a rest api directly into spark?
Basically have this which I know I can process in scala and save to a file but would like to process the data in spark:
val resultCsv = scala.io.Source.fromURL(url).getLines()
回答1:
This is how it can be done.
For Spark 2.2.x
import scala.io.Source._
import org.apache.spark.sql.{Dataset, SparkSession}
var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.printSchema()
using databricks lib for older version of Spark
import scala.io.Source._
import com.databricks.spark.csv.CsvParser
var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val frame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
frame.printSchema()
Note:- I am new to Scala any improvements will be appreciated.
ref: here
来源:https://stackoverflow.com/questions/44961433/process-csv-from-rest-api-into-spark