PySpark: Empty RDD on reading gzipped BSON files

走远了吗. 提交于 2019-12-12 03:08:48

问题


I have a script to analyse BSON dumps, however it works only with uncompressed files. I get an empty RDD while reading gz bson files.

pyspark_location = 'lib/pymongo_spark.py'
HDFS_HOME = 'hdfs://1.1.1.1/'
INPUT_FILE = 'big_bson.gz'


class BsonEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, ObjectId):
            return str(obj)
        elif isinstance(obj, datetime):
            return obj.isoformat()
        return JSONEncoder.default(self, obj)


def setup_spark_with_pymongo(app_name='App'):
    conf = SparkConf().setAppName(app_name)
    sc = SparkContext(conf=conf)
    sc.addPyFile(pyspark_location)
    return sc


def main():
    spark_context = setup_spark_with_pymongo('PysparkApp')
    filename = HDFS_HOME + INPUT_FILE
    import pymongo_spark
    pymongo_spark.activate()
    rdd = spark_context.BSONFileRDD(filename)
    print(rdd.first())  #Raises ValueError("RDD is empty")

I am using mongo-java-driver-3.2.2.jar, mongo-hadoop-spark-1.5.2.jar, pymongo-3.2.2-py2.7-linux-x86_64 and pymongo_spark in along with spark-submit. The version of Spark deployed is 1.6.1 along with Hadoop 2.6.4.

I am aware that the library does not support splitting compressed BSON files, however it should with a single split. I have got hundreds of compressed BSON files to analyse and deflating each of them doesn't seem to be a viable option.

Any idea how should I proceed further? Thanks in advance!


回答1:


I've just tested in the environment: mongo-hadoop-spark-1.5.2.jar, spark version 1.6.1 for Hadoop 2.6.4, Pymongo 3.2.2. The source file is an output from mongodump compressed, and a small size file for a single split (uncompressed collection size of 105MB). Running through PySpark:

from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
conf = SparkConf().setAppName("pyspark-bson")
file_path = "/file/example_bson.gz"
rdd = sc.BSONFileRDD(file_path)
rdd.first()

It is able to read the compressed BSON file, and listed the first document. Please make sure you can reach the input file, and the file is in the correct BSON format.



来源:https://stackoverflow.com/questions/36886704/pyspark-empty-rdd-on-reading-gzipped-bson-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!