pyspark Hive Context — read table with UTF-8 encoding

☆樱花仙子☆ 提交于 2019-12-24 11:34:58

问题


I have a table in hive, And I am reading that table in pyspark df_sprk_df

from pyspark import SparkContext
from pysaprk.sql import HiveContext
sc = SparkContext()
hive_context = HiveContext(sc)
df_sprk_df = hive_context.sql('select * from databasename.tablename')
df_pandas_df = df_sprk_df.toPandas()
df_pandas_df = df_pandas_df.astype('str')

but when I try to convert df_pandas_df to astype of str. but I get error like

UnicodeEnCodeError :'ascii' codec cant encode character u'\u20ac' in position

Even I tried to convert column to str one by one as

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].str.encode('utf-8')

but no luck, so basically how can I import hive table to dataframe in utf-8 encoding


回答1:


So this workaround helped to solve this, By changing the default encoding for the session

import sys
reload(sys)
sys.setdefaultencoding('UTF-8')

and then

df_pandas_df = df_pandas_df.astype(str)

converts whole dataframe as string df.




回答2:


Instead of directly casting it to string try to infer types of pandas DataFrame using following statement:

df_pandas_df .apply(lambda x: pd.lib.infer_dtype(x.values))

UPD: try to perform mapping without .str invocation.

Maybe something like below:

for cols in df_pandas_df.columns:
    df_pandas_df[cols] = df_pandas_df[cols].apply(lambda x: unicode(x, errors='ignore'))


来源:https://stackoverflow.com/questions/52076651/pyspark-hive-context-read-table-with-utf-8-encoding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!