When registering a table using the %pyspark interpreter in Zeppelin, I can't access the table in %sql

拈花ヽ惹草 提交于 2020-01-01 10:16:13

问题


I am using Zeppelin 0.5.5. I found this code/sample here for python as I couldn't get my own to work with %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/. I have a feeling his %pyspark example worked because if you using the original %spark zeppelin tutorial the "bank" table is already created.

This code is in a notebook.

%pyspark
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(),     False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")

This code is in the same notebook but different work pad.

%sql 
SELECT count(1) FROM bank

org.apache.spark.sql.AnalysisException: no such table bank; line 1 pos 21
...

回答1:


I found the problem to this issue. Prior to 0.6.0 the sqlContext variable is sqlc in %pyspark.

Defect can be found here: https://issues.apache.org/jira/browse/ZEPPELIN-134

In Pyspark, the SQLContext is currently available in the variable name sqlc. This is incosistent with the documentation and with the variable name in scala which is sqlContext.

sqlContext can be used as a variable for the SQLContext, in addition to sqlc (for backward compatibility)

Related code: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66

The suggested workaround is simply to do the following in your %pyspark script

sqlContext = sqlc

Found here:

https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.com%3E




回答2:


Instead of sqlContext, use sqlc and replace registerAsTable by sqlc.registerDataFrameAsTable

%pyspark
from os import getcwd
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")

bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])

bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))

bankdf = sqlc.createDataFrame(bank,bankSchema)
sqlc.registerDataFrameAsTable(bankdf, "bank")


%sql 
SELECT count(1) FROM bank


来源:https://stackoverflow.com/questions/33882360/when-registering-a-table-using-the-pyspark-interpreter-in-zeppelin-i-cant-acc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!