How to create an empty DataFrame? Why “ValueError: RDD is empty”?

问题

I am trying to create an empty dataframe in Spark (Pyspark).

I am using similar approach to the one discussed here enter link description here, but it is not working.

This is my code

df = sqlContext.createDataFrame(sc.emptyRDD(), schema)

This is the error

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 404, in createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 285, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/context.py", line 229, in _inferSchema
first = rdd.first()
File "/Users/Me/Desktop/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty

回答1:

extending Joe Widen's answer, you can actually create the schema with no fields like so:

schema = StructType([])

so when you create the DataFrame using that as your schema, you'll end up with a DataFrame[].

>>> empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
DataFrame[]
>>> empty.schema
StructType(List())

In Scala, if you choose to use sqlContext.emptyDataFrame and check out the schema, it will return StructType().

scala> val empty = sqlContext.emptyDataFrame
empty: org.apache.spark.sql.DataFrame = []

scala> empty.schema
res2: org.apache.spark.sql.types.StructType = StructType()

回答2:

At the time this answer was written it looks like you need some sort of schema

from pyspark.sql.types import *
field = [StructField("field1", StringType(), True)]
schema = StructType(field)

sqlContext.createDataFrame(sc.emptyRDD(), schema)

回答3:

This will work with spark version 2.0.0 or more

from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(), False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)

回答4:

You can just use something like this:

   pivot_table = sparkSession.createDataFrame([("99","99")], ["col1","col2"])

回答5:

You can create an empty data frame by using following syntax in pyspark:

df = spark.createDataFrame([], ["col1", "col2", ...])

where [] represents the empty value for col1 and col2. Then you can register as temp view for your sql queries:

**df2.createOrReplaceTempView("artist")**

回答6:

You can do it by loading an empty file (parquet, json etc.) like this:

df = sqlContext.read.json("my_empty_file.json")

Then when you try to check the schema you'll see:

>>> df.printSchema()
root

In Scala/Java not passing a path should work too, in Python it throws an exception. Also if you ever switch to Scala/Python you can use this method to create one.

回答7:

spark.range(0).drop("id")

This creates a DataFrame with an "id" column and no rows then drops the "id" column, leaving you with a truly empty DataFrame.

来源：https://stackoverflow.com/questions/34624681/how-to-create-an-empty-dataframe-why-valueerror-rdd-is-empty

标签

apache-spark

pyspark