Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

送分小仙女□ 提交于 2019-12-01 11:41:40

This is because the spark-submit job is unable to find the hive-site.xml, so it cannot connect to the Hive metastore. Please add --files /usr/iop/4.2.0.0/hive/conf/hive-site.xml to your spark-submit command.

It looks like you are affected by this bug: https://issues.apache.org/jira/browse/SPARK-15345.



I had a similar issue with Spark 1.6.2 and 2.0.0 on HDP-2.5.0.0:
My goal was to create a dataframe from a Hive SQL query, under these conditions:

  • python API,
  • cluster deploy-mode (driver program running on one of the executor nodes)
  • use YARN to manage the executor JVMs (instead of a standalone Spark master instance).

The initial tests gave these results:

  1. spark-submit --deploy-mode client --master local ... => WORKING
  2. spark-submit --deploy-mode client --master yarn ... => WORKING
  3. spark-submit --deploy-mode cluster --master yarn .... => NOT WORKING

In case #3, the driver running on one of the executor nodes could find the database. The error was:

pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14'



Fokko Driesprong's answer listed above worked for me.
With, the command listed below, the driver running on the executor node was able to access a Hive table in a database which is not default:

$ /usr/hdp/current/spark2-client/bin/spark-submit \
--deploy-mode cluster --master yarn \
--files /usr/hdp/current/spark2-client/conf/hive-site.xml \
/path/to/python/code.py



The python code I have used to test with Spark 1.6.2 and Spark 2.0.0 is: (Change SPARK_VERSION to 1 to test with Spark 1.6.2. Make sure to update the paths in the spark-submit command accordingly)

SPARK_VERSION=2      
APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION)



def spark1():
    from pyspark.sql import HiveContext
    from pyspark import SparkContext, SparkConf

    conf = SparkConf().setAppName(APP_NAME)
    sc = SparkContext(conf=conf)
    hc = HiveContext(sc)

    query = 'select * from database_name.table_name limit 5'
    df = hc.sql(query)
    printout(df)




def spark2():
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate()
    query = 'select * from database_name.table_name limit 5'
    df = spark.sql(query)
    printout(df)




def printout(df):
    print('\n########################################################################')
    df.show()
    print(df.count())

    df_list = df.collect()
    print(df_list)
    print(df_list[0])
    print(df_list[1])
    print('########################################################################\n')




def main():
    if SPARK_VERSION == 1:
        spark1()
    elif SPARK_VERSION == 2:
        spark2()




if __name__ == '__main__':
    main()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!