问题
Does concurrent access to append rows using union in a dataframe using following code will work correctly? Currently showing type error
from pyspark.sql.types import *
schema = StructType([
StructField("owreg", StringType(), True),StructField("we", StringType(), True)
,StructField("aa", StringType(), True)
,StructField("cc", StringType(), True)
,StructField("ss", StringType(), True)
,StructField("ss", StringType(), True)
,StructField("sss", StringType(), True)
])
f = sqlContext.createDataFrame(sc.emptyRDD(), schema)
def dump(l,jsid):
if not l.startswith("<!E!>"):
f=f.unionAll(sqlContext.read.json(l))
savedlabels.limit(10).foreach(lambda a: dump(a.labels,a.job_seq_id))
Assume that sqlContext.read.json(l) will read a json and output a RDD with the same schema
The pattern is that I want to "reduce" multiple json tables stored in a column of an RDD to an RDD table as efficiently as possible.
def dump(l,jsid):
if not l.startswith("<!E!>"):
f=f.unionAll(sc.parallelize(json.loads(l)).toDF())
The above code will also no work as sc.parallelize is being invoked by the worker threads. Hence how to solve this problem?
来源:https://stackoverflow.com/questions/37584185/how-to-reduce-multiple-json-tables-stored-in-a-column-of-an-rdd-to-a-single-rd