How to build Spark data frame with filtered records from MongoDB?

问题

My application has been built utilizing MongoDB as a platform. One collection in DB has massive volume of data and have opted for apache spark to retrieve and generate analytical data through calculation. I have configured Spark Connector for MongoDB to communicate with MongoDB. I need to query MongoDB collection using pyspark and build a dataframe consisting of resultset of mongodb query. Please suggest me an appropriate solution to it.

回答1:

You can load the data directly into a dataframe like so:

# Create the dataframe
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/mydb.mycoll").load()

# Filter the data via the api
df.filter(people.age > 30)

# Filter via sql
df.registerTempTable("people")
over_thirty = sqlContext.sql("SELECT name, age FROM people WHERE age > 30")

For more information see the Mongo Spark connector Python API section or the introduction.py. The SQL queries are translated and passed back to the connector so that the data can be queried in MongoDB before being sent to the spark cluster.

You can also provide your own aggregation pipeline to apply to the collection before returning results into Spark:

dfr = sqlContext.read.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
df = dfr.option("uri", ...).format("com.mongodb.spark.sql.DefaultSource").load()

回答2:

For my case filtering did not give the expected performance, as all filtering happened in Spark and not in Mongo. To improve performance, I had to pass a manual aggregation pipeline when loading the data. This can be a bit difficult to find, since the official documentation only talks how to do it with RDDs.

After a lot of trials I managed to do this with Scala dataframes:

val pipeLine =  "[{ $match: { 'data.account.status: 'ACTIVE', " +
      "'data.account.activationDate: {$gte : '2020-10-11', $lte : '2020-10-13'}}}]"

  val readConfig: ReadConfig = ReadConfig(
    Map(
      "uri" -> getMongoURI(),
      "database" -> dataBaseName,
      "collection" -> collection,
      "pipeLine" -> pipeLine
    )
  )

// This one took 260 seconds
val df: DataFrame = MongoSpark.load(sparkSession, readConfig)
df.count()

The alternative using filters and no pipeline fetches all data to Spark. It should not be the case, but I presume it has to do with the query used.

  val readConfig: ReadConfig = ReadConfig(
    Map(
      "uri" -> getMongoURI(),
      "database" -> dataBaseName,
      "collection" -> collection
    )
  )
// This one took 560 seconds
val df: DataFrame = MongoSpark.load(sparkSession, readConfig)
df.filter("data.account.status == 'ACTIVE' AND " +
      "data.account.activationDate>= '2020-05-13' AND data.account.activationDate <= '2021-06-05'"
    ).count()

I did some tests to fetch 400K documents from a localhost Mongo DB holding in total 1.4M documents:

(RDD + pipeline), as per official documentation: 144 seconds
(DF + pipeline), as per example above: 260 seconds
(DF with filters), as per example above: 560 seconds
(RDD + pipeline).toDf: 736 seconds

We finally went for the second option, because of some other high-level benefits of working with dataframes vs RDDs.

Finally, don't forget to create the correct indexes in MongoDB!

Edit: I am using spark-sql 2.3.1, mongo-spark-connector 2.3.2 and mongo-java-driver 3.12.3.

来源：https://stackoverflow.com/questions/38847202/how-to-build-spark-data-frame-with-filtered-records-from-mongodb

标签

mongodb

apache-spark

mongodb-query

pyspark