PySpark: Empty RDD on reading gzipped BSON files

问题

I have a script to analyse BSON dumps, however it works only with uncompressed files. I get an empty RDD while reading gz bson files.

pyspark_location = 'lib/pymongo_spark.py'
HDFS_HOME = 'hdfs://1.1.1.1/'
INPUT_FILE = 'big_bson.gz'


class BsonEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, ObjectId):
            return str(obj)
        elif isinstance(obj, datetime):
            return obj.isoformat()
        return JSONEncoder.default(self, obj)


def setup_spark_with_pymongo(app_name='App'):
    conf = SparkConf().setAppName(app_name)
    sc = SparkContext(conf=conf)
    sc.addPyFile(pyspark_location)
    return sc


def main():
    spark_context = setup_spark_with_pymongo('PysparkApp')
    filename = HDFS_HOME + INPUT_FILE
    import pymongo_spark
    pymongo_spark.activate()
    rdd = spark_context.BSONFileRDD(filename)
    print(rdd.first())  #Raises ValueError("RDD is empty")

I am using mongo-java-driver-3.2.2.jar, mongo-hadoop-spark-1.5.2.jar, pymongo-3.2.2-py2.7-linux-x86_64 and pymongo_spark in along with spark-submit. The version of Spark deployed is 1.6.1 along with Hadoop 2.6.4.

I am aware that the library does not support splitting compressed BSON files, however it should with a single split. I have got hundreds of compressed BSON files to analyse and deflating each of them doesn't seem to be a viable option.

Any idea how should I proceed further? Thanks in advance!

回答1:

I've just tested in the environment: mongo-hadoop-spark-1.5.2.jar, spark version 1.6.1 for Hadoop 2.6.4, Pymongo 3.2.2. The source file is an output from mongodump compressed, and a small size file for a single split (uncompressed collection size of 105MB). Running through PySpark:

from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
conf = SparkConf().setAppName("pyspark-bson")
file_path = "/file/example_bson.gz"
rdd = sc.BSONFileRDD(file_path)
rdd.first()

It is able to read the compressed BSON file, and listed the first document. Please make sure you can reach the input file, and the file is in the correct BSON format.

来源：https://stackoverflow.com/questions/36886704/pyspark-empty-rdd-on-reading-gzipped-bson-files

标签

mongodb

pyspark

bson