I have some data in HDFS,i need to access that data using python,can anyone tell me how data is accessed from hive using python?
To install you'll need these libraries:
pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive
If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager. For Windows there are some options on GNU.org. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install)
After installation, you can execute a hive query like this:
from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")
Now that you have the hive connection, you have options how to use it. You can just straight-up query:
cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
use_result(result)
...or to use the connection to make a Pandas dataframe:
import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)