How to Access Hive via Python?

后端 未结 16 855
小蘑菇
小蘑菇 2020-11-30 17:11

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Python appears to be outdated.

When I add this to /etc/profile:

export PYTHONP         


        
16条回答
  •  北海茫月
    2020-11-30 17:39

    Similar to eycheu's solution, but a little more detailed.

    Here is an alternative solution specifically for hive2 that does not require PyHive or installing system-wide packages. I am working on a linux environment that I do not have root access to so installing the SASL dependencies as mentioned in Tristin's post was not an option for me:

    If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager for your distribution.

    Specifically, this solution focuses on leveraging the python package: JayDeBeApi. In my experience installing this one extra package on top of a python Anaconda 2.7 install was all I needed. This package leverages java (JDK). I am assuming that is already set up.

    Step 1: Install JayDeBeApi

    pip install jaydebeap
    

    Step 2: Download appropriate drivers for your environment:

    • Here is a link to the jars required for an enterprise CDH environment
    • Another post that talks about where to find jdbc drivers for Apache Hive

    Store all .jar files in a directory. I will refer to this directory as /path/to/jar/files/.

    Step 3: Identify your systems authentication mechanism:

    In the pyhive solutions listed I've seen PLAIN listed as the authentication mechanism as well as Kerberos. Note that your jdbc connection URL will depend on the authentication mechanism you are using. I will explain Kerberos solution without passing a username/password. Here is more information Kerberos authentication and options.

    Create a Kerberos ticket if one is not already created

    $ kinit
    

    Tickets can be viewed via klist.

    You are now ready to make the connection via python:

    import jaydebeapi
    import glob
    # Creates a list of jar files in the /path/to/jar/files/ directory
    jar_files = glob.glob('/path/to/jar/files/*.jar')
    
    host='localhost'
    port='10000'
    database='default'
    
    # note: your driver will depend on your environment and drivers you've
    # downloaded in step 2
    # this is the driver for my environment (jdbc3, hive2, cloudera enterprise)
    driver='com.cloudera.hive.jdbc3.HS2Driver'
    
    conn_hive = jaydebeapi.connect(driver,
            'jdbc:hive2://'+host+':' +port+'/'+database+';AuthMech=1;KrbHostFQDN='+host+';KrbServiceName=hive'
                               ,jars=jar_files)
    

    If you only care about reading, then you can read it directly into a panda's dataframe with ease via eycheu's solution:

    import pandas as pd
    df = pd.read_sql("select * from table", conn_hive)
    

    Otherwise, here is a more versatile communication option:

    cursor = conn_hive.cursor()
    sql_expression = "select * from table"
    cursor.execute(sql_expression)
    results = cursor.fetchall()
    

    You could imagine, if you wanted to create a table, you would not need to "fetch" the results, but could submit a create table query instead.

提交回复
热议问题