RDD of BSONObject to a DataFrame

我怕爱的太早我们不能终老 提交于 2021-01-28 18:47:49

问题


I'm loading a bson dump from Mongo into Spark as described here. It works, but what I get is:

org.apache.spark.rdd.RDD[(Object, org.bson.BSONObject)]

It should basically be just JSON with all String fields. The rest of my code requires a DataFrame object to manipulate the data. But, of course, toDF fails on that RDD. How can I convert it to a Spark DataFrame with all fields as String? Something similar to spark.read.json would be great to have.


回答1:


val datapath = "path_to_bson_file.bson" 

import org.apache.hadoop.conf.Configuration

// Set up the configuration for reading from bson dump.
val bsonConfig = new Configuration()
bsonConfig.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat")

// given with your spark session 
implicit lazy val sparkSession = initSpark()

// read the RDD[org.bson.BSONObject]
val bson_data_as_json_string = sparkSession.sparkContext.newAPIHadoopFile(datapath,
  classOf[com.mongodb.hadoop.BSONFileInputFormat].
    asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object, org.bson.BSONObject]]),
  classOf[Object],
  classOf[org.bson.BSONObject],
  bsonConfig).
  map{row => {
    // map BSON object to JSON string
    val json = com.mongodb.util.JSON.serialize(row._2)
    json
  }
}

// read into JSON spark Dataset:
val bson_data_as_json_dataset = sparkSession.sqlContext.read.json(bson_data_as_json_string)
// eval the schema:
bson_data_as_json_dataset.printSchema()



回答2:


Try with the below code

def parseData(s:String)={
val doc=org.bson.Document.parse(s)
val jsonDoc=com.mongodb.util.JSON.serialize(doc)
jsonDoc

val df=spark.read.json(spark.sparkContext.newAPIHadoopFile("src//main//resources//MyDummyData",classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object,BSONObject]]), classOf[Object], classOf[BSONObject]).map(x=>x._2).map(x=>parseData(x.toString)))


来源:https://stackoverflow.com/questions/39851476/rdd-of-bsonobject-to-a-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!