问题
I've been using the MongoDB Connector for Spark to load DataFrames from MongoDB collections.
I'd like to move more of my ETL process into Spark and want to get 1-2 GB files into Spark from a Java service that does the basic file ingestion and parsing. Since I've already got a MongoDB cluster, it'd be easy to drop JSON-line format data into GridFS, and I'd rather not set up a cluster filesystem or HDFS just for this.
The Mongo Spark connector knows nothing of GridFS. The MongoDB Connector for Hadoop does have a GridFSInputFormat, documented in a JIRA comment.
I see the old SparkContext
class has a newAPIHadoopFile()
method that takes an InputFormat to build an RDD, but I thought SparkSession
was the new hotness.
Is it possible to have Spark load a DataFrame
from a Hadoop InputFormat like the GridFSInputFormat? I want to read a JSON-lines file from GridFS, infer the schema, and end up with a DataSet[Row]
. And is there anything glaringly insane with this approach?
回答1:
No big deal in the end. I added the Mongo Hadoop connector:
libraryDependencies += "org.mongodb.mongo-hadoop" % "mongo-hadoop-core" % "2.0.2"
And used it to get an RDD[(NullWritable, Text)]
, which converts easily to RDD[String]
with a call to map
, and then to DataFrame
with sparkSession.read.json
:
/** Loads a DataFrame from a MongoDB GridFS file in JSON-lines format */
def loadJsonLinesFromGridFSFile(gridFsId: String): DataFrame = {
jsonLinesToDataFrame(loadRDDFromGridFSFile(gridFsId))
}
/** Uses the Mongo Hadoop plugin to load an RDD of lines from a GridFS file */
private def loadRDDFromGridFSFile(gridFsId: String): RDD[String] = {
val conf = new Configuration()
val uri = config.uri.getCredentials
conf.set("mongo.input.uri", "mongodb://127.0.0.1/somedb.fs")
conf.set("mongo.input.format", classOf[GridFSInputFormat].getName)
conf.set("mongo.input.query", s"{ _id: { $$oid: '$gridFsId' } }")
sparkSession.sparkContext.newAPIHadoopRDD(
conf, classOf[GridFSInputFormat], classOf[NullWritable], classOf[BinaryComparable]).map(_._2.toString)
}
private def jsonLinesToDataFrame(rdd: RDD[String]): DataFrame = {
sparkSession.read.json(rdd)
}
来源:https://stackoverflow.com/questions/44091351/loading-a-spark-2-x-dataframe-from-mongodb-gridfs