pyspark Hive Context — read table with UTF-8 encoding

问题

I have a table in hive, And I am reading that table in pyspark df_sprk_df

from pyspark import SparkContext
from pysaprk.sql import HiveContext
sc = SparkContext()
hive_context = HiveContext(sc)
df_sprk_df = hive_context.sql('select * from databasename.tablename')
df_pandas_df = df_sprk_df.toPandas()
df_pandas_df = df_pandas_df.astype('str')

but when I try to convert df_pandas_df to astype of str. but I get error like

UnicodeEnCodeError :'ascii' codec cant encode character u'\u20ac' in position

Even I tried to convert column to str one by one as

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].str.encode('utf-8')

but no luck, so basically how can I import hive table to dataframe in utf-8 encoding

回答1:

So this workaround helped to solve this, By changing the default encoding for the session

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

and then

df_pandas_df = df_pandas_df.astype(str)

converts whole dataframe as string df.

回答2:

Instead of directly casting it to string try to infer types of pandas DataFrame using following statement:

df_pandas_df .apply(lambda x: pd.lib.infer_dtype(x.values))

UPD: try to perform mapping without .str invocation.

Maybe something like below:

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].apply(lambda x: unicode(x, errors='ignore'))

来源：https://stackoverflow.com/questions/52076651/pyspark-hive-context-read-table-with-utf-8-encoding

标签

python

apache-spark

pyspark

apache-spark-sql