问题
I'm trying to perform a simple task in spark dataframe (python) which is create new dataframe by selecting specific column and nested columns from another dataframe for example :
df.printSchema()
root
|-- time_stamp: long (nullable = true)
|-- country: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- id: long (nullable = true)
| |-- time_zone: string (nullable = true)
|-- event_name: string (nullable = true)
|-- order: struct (nullable = true)
| |-- created_at: string (nullable = true)
| |-- creation_type: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- name: string (nullable = true)
| |-- destination: struct (nullable = true)
| | |-- state: string (nullable = true)
| |-- ordering_user: struct (nullable = true)
| | |-- cancellation_score: long (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_test: boolean (nullable = true)
df2=df.sqlContext.sql("""select a.country_code as country_code,
a.order_destination_state as order_destination_state,
a.order_ordering_user_id as order_ordering_user_id,
a.order_ordering_user_is_test as order_ordering_user_is_test,
a.time_stamp as time_stamp
from
(select
flat_order_creation.order.destination.state as order_destination_state,
flat_order_creation.order.ordering_user.id as order_ordering_user_id,
flat_order_creation.order.ordering_user.is_test as order_ordering_user_is_test,
flat_order_creation.time_stamp as time_stamp
from flat_order_creation) a""")
and I get the following error:
Traceback (most recent call last):
File "/home/hadoop/scripts/orders_all.py", line 180, in <module>
df2=sqlContext.sql(q)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 552, in sql
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 36, in deco
File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.sql.
: java.lang.RuntimeException: [6.21] failure: ``*'' expected but `order' found
flat_order_creation.order.destination.state as order_destination_state,
I'm using spark-submit with master in local mode to run the this code. it important to mention the when I'm connecting to pyspark shell and run the code (line by line) it works , but when submitting it (even in local mode) it fails. another thing is important to mention is that when selecting a non nested field it works as well. I'm using spark 1.5.2 on EMR (version 4.2.0)
回答1:
Without a Minimal, Complete, and Verifiable example I can only guess but it looks like you're using different SparkContext
implementations in the interactive shell and your standalone program.
As long as Spark binaries have been build with Hive support sqlContext
provided in the shell is a HiveContext
. Among other differences it provides more sophisticated SQL parser than a plain SQLContext
. You can easily reproduce your problem as follows:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
val conf: SparkConf = ???
val sc: SparkContext = ???
val query = "SELECT df.foobar.order FROM df"
val hiveContext: SQLContext = new HiveContext(sc)
val sqlContext: SQLContext = new SQLContext(sc)
val json = sc.parallelize(Seq("""{"foobar": {"order": 1}}"""))
sqlContext.read.json(json).registerTempTable("df")
sqlContext.sql(query).show
// java.lang.RuntimeException: [1.18] failure: ``*'' expected but `order' found
// ...
hiveContext.read.json(json).registerTempTable("df")
hiveContext.sql(query)
// org.apache.spark.sql.DataFrame = [order: bigint]
Initialization sqlContext
with HiveContext
in the standalone program should do the trick:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame(...)
df.registerTempTable("flat_order_creation")
sqlContext.sql(...)
It is important to note that problem is not nesting itself but using ORDER
keyword as a column name. So if using HiveContext
is not an option just change a name of the field to something else.
来源:https://stackoverflow.com/questions/35015275/py4j-protocol-py4jjavaerror-when-selecting-nested-column-in-dataframe-using-sele