Loading a Spark 2.x DataFrame from MongoDB GridFS

问题

I've been using the MongoDB Connector for Spark to load DataFrames from MongoDB collections.

I'd like to move more of my ETL process into Spark and want to get 1-2 GB files into Spark from a Java service that does the basic file ingestion and parsing. Since I've already got a MongoDB cluster, it'd be easy to drop JSON-line format data into GridFS, and I'd rather not set up a cluster filesystem or HDFS just for this.

The Mongo Spark connector knows nothing of GridFS. The MongoDB Connector for Hadoop does have a GridFSInputFormat, documented in a JIRA comment.

I see the old SparkContext class has a newAPIHadoopFile() method that takes an InputFormat to build an RDD, but I thought SparkSession was the new hotness.

Is it possible to have Spark load a DataFrame from a Hadoop InputFormat like the GridFSInputFormat? I want to read a JSON-lines file from GridFS, infer the schema, and end up with a DataSet[Row]. And is there anything glaringly insane with this approach?

回答1:

No big deal in the end. I added the Mongo Hadoop connector:

libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-core" % "2.0.2"

And used it to get an RDD[(NullWritable, Text)], which converts easily to RDD[String] with a call to map, and then to DataFrame with sparkSession.read.json:

/** Loads a DataFrame from a MongoDB GridFS file in JSON-lines format */
def loadJsonLinesFromGridFSFile(gridFsId: String): DataFrame = {
  jsonLinesToDataFrame(loadRDDFromGridFSFile(gridFsId))
}

/** Uses the Mongo Hadoop plugin to load an RDD of lines from a GridFS file */
private def loadRDDFromGridFSFile(gridFsId: String): RDD[String] = {
  val conf = new Configuration()
  val uri = config.uri.getCredentials
  conf.set("mongo.input.uri", "mongodb://127.0.0.1/somedb.fs")
  conf.set("mongo.input.format", classOf[GridFSInputFormat].getName)
  conf.set("mongo.input.query", s"{ _id: { $$oid: '$gridFsId' } }")
  sparkSession.sparkContext.newAPIHadoopRDD(
    conf, classOf[GridFSInputFormat], classOf[NullWritable], classOf[BinaryComparable]).map(_._2.toString)
}

private def jsonLinesToDataFrame(rdd: RDD[String]): DataFrame = {
  sparkSession.read.json(rdd)
}

来源：https://stackoverflow.com/questions/44091351/loading-a-spark-2-x-dataframe-from-mongodb-gridfs

标签

mongodb

scala

apache-spark

apache-spark-sql

gridfs