问题
I am trying to create an empty dataframe in Spark (Pyspark).
I am using similar approach to the one discussed here enter link description here, but it is not working.
This is my code
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
This is the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
回答1:
extending Joe Widen's answer, you can actually create the schema with no fields like so:
schema = StructType([])
so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[]
.
>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())
In Scala, if you choose to use sqlContext.emptyDataFrame
and check out the schema, it will return StructType()
.
scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []
scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()
回答2:
At the time this answer was written it looks like you need some sort of schema
from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)
sqlContext.createDataFrame(sc.emptyRDD(), schema)
回答3:
This will work with spark version 2.0.0 or more
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
回答4:
You can just use something like this:
pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])
回答5:
You can create an empty data frame by using following syntax in pyspark:
df = spark.createDataFrame([], ["col1", "col2", ...])
where []
represents the empty value for col1
and col2
. Then you can register as temp view for your sql queries:
**df2.createOrReplaceTempView("artist")**
回答6:
You can do it by loading an empty file (parquet
, json
etc.) like this:
df = sqlContext.read.json("my_empty_file.json")
Then when you try to check the schema you'll see:
>>> df.printSchema()
root
In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.
回答7:
spark.range(0).drop("id")
This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.
来源:https://stackoverflow.com/questions/34624681/how-to-create-an-empty-dataframe-why-valueerror-rdd-is-empty