How to “reduce” multiple json tables stored in a column of an RDD to a single RDD table as efficiently as possible

问题

Does concurrent access to append rows using union in a dataframe using following code will work correctly? Currently showing type error

from pyspark.sql.types import *
schema = StructType([
    StructField("owreg", StringType(), True),StructField("we", StringType(), True)
        ,StructField("aa", StringType(), True)
        ,StructField("cc", StringType(), True)
        ,StructField("ss", StringType(), True)
        ,StructField("ss", StringType(), True)
        ,StructField("sss", StringType(), True)
])

f = sqlContext.createDataFrame(sc.emptyRDD(), schema)
def dump(l,jsid):
    if not l.startswith("<!E!>"):
         f=f.unionAll(sqlContext.read.json(l))
savedlabels.limit(10).foreach(lambda a: dump(a.labels,a.job_seq_id))

Assume that sqlContext.read.json(l) will read a json and output a RDD with the same schema

The pattern is that I want to "reduce" multiple json tables stored in a column of an RDD to an RDD table as efficiently as possible.

def dump(l,jsid):
    if not l.startswith("<!E!>"):
        f=f.unionAll(sc.parallelize(json.loads(l)).toDF())

The above code will also no work as sc.parallelize is being invoked by the worker threads. Hence how to solve this problem?

来源：https://stackoverflow.com/questions/37584185/how-to-reduce-multiple-json-tables-stored-in-a-column-of-an-rdd-to-a-single-rd

标签

python

concurrency

pyspark

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!