Spark DataFrames: registerTempTable vs not

前端 未结 2 912
花落未央
花落未央 2020-12-28 12:56

I just started with DataFrame yesterday and am really liking it so far.

I dont understand one thing though... (Referring to the example under \"Programmatically Spec

相关标签:
2条回答
  • 2020-12-28 13:47

    It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:

    df.createOrReplaceTempView("myTempView")
    

    Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.

    %sql
    SELECT * FROM myTempView
    
    0 讨论(0)
  • 2020-12-28 13:52

    The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.

    val sc: SparkContext = ...
    val hc = new HiveContext( sc )
    val customerDataFrame = myCodeToCreateOrLoadDataFrame()
    customerDataFrame.registerTempTable( "cust" )
    val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId"""
    val salesPerCustomer: DataFrame = hc.sql( query )
    salesPerCustomer.show()
    

    Whether to use SQL or DataFrame methods like select and groupBy is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.

    In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.

    If you want to use SQL, then you most likely will want to create a HiveContext instead of a regular SQLContext. The Hive query language supports a broader range of SQL than available via a plain SQLContext.

    0 讨论(0)
提交回复
热议问题