Create Spark Dataset from a CSV file

ぐ巨炮叔叔 提交于 2020-05-26 10:59:13

问题


I would like to create a Spark Dataset from a simple CSV file. Here are the contents of the CSV file:

name,state,number_of_people,coolness_index
trenton,nj,"10","4.5"
bedford,ny,"20","3.3"
patterson,nj,"30","2.2"
camden,nj,"40","8.8"

Here is the code to make the Dataset:

var location = "s3a://path_to_csv"

case class City(name: String, state: String, number_of_people: Long)

val cities = spark.read
  .option("header", "true")
  .option("charset", "UTF8")
  .option("delimiter",",")
  .csv(location)
  .as[City]

Here is the error message: "Cannot up cast number_of_people from string to bigint as it may truncate"

Databricks talks about creating Datasets and this particular error message in this blog post.

Encoders eagerly check that your data matches the expected schema, providing helpful error messages before you attempt to incorrectly process TBs of data. For example, if we try to use a datatype that is too small, such that conversion to an object would result in truncation (i.e. numStudents is larger than a byte, which holds a maximum value of 255) the Analyzer will emit an AnalysisException.

I am using the Long type, so I didn't expect to see this error message.


回答1:


Use schema inference:

val cities = spark.read
  .option("inferSchema", "true")
  ...

or provide schema:

val cities = spark.read
  .schema(StructType(Array(StructField("name", StringType), ...)

or cast:

val cities = spark.read
  .option("header", "true")
  .csv(location)
  .withColumn("number_of_people", col("number_of_people").cast(LongType))
  .as[City]



回答2:


with your case class as case class City(name: String, state: String, number_of_people: Long), you just need one line

private val cityEncoder = Seq(City("", "", 0)).toDS

then you code

val cities = spark.read
.option("header", "true")
.option("charset", "UTF8")
.option("delimiter",",")
.csv(location)
.as[City]

will just work.

This is the official source [http://spark.apache.org/docs/latest/sql-programming-guide.html#overview][1]




回答3:


Input csv file User.csv
id,name,address
1,Arun,Indore
2,Shubham,Indore
3,Mukesh,Hariyana

public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");
        SparkSession sparkSession = new SparkSession(new SparkContext(sparkConf));

        Dataset<Row> dataset = sparkSession.read().option("header", "true")
                .csv("C:\\Users\\arun7.gupta\\Desktop\\Spark\\User.csv");

        dataset.show();
        sparkSession.close();
    }

**Output:** 
+---+-------+--------+
| id|   name| address|
+---+-------+--------+
|  1|   Arun|  Indore|
|  2|Shubham|  Indore|
|  3| Mukesh|Hariyana|
+---+-------+--------+


来源:https://stackoverflow.com/questions/39522411/create-spark-dataset-from-a-csv-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!