Datasets in Apache Spark

荒凉一梦 提交于 2019-12-02 09:30:09

As @user9718686 wrote, you have different types for the id field: String in your json file, and long in your class definition. When you read it into Dataset<Row>, then Spark infers the schema from the file and detects that the id is of type String and that is why it worked when you tried to print it (as you asked for this in one of your comments). If you want to have the dataframe as Dataset<Tweet>, then you have to change your json files to use long ids instead of String or you can let Spark cast this id when you try to perform any action operation on your dataframe.

Dataset<Row> rowDataset = sc.read().json("path");
Dataset<Tweet> tweetDataset = rowDataset
                .withColumn("id", rowDataset.col("id").cast(DataTypes.LongType))
                .as(Encoders.bean(Tweet.class));
tweetDataset.printSchema();
System.out.println(tweetDataset.head().getId());

It gives you an error because of type mismatch:

  • Tweet class defines id field as Long.
  • Your data has id as String.

You have to either cast input or adjust class definition.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!