Transforming a column and update the DataFrame

本小妞迷上赌 提交于 2019-12-02 11:45:31

问题


So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames.

df = df_data.drop('A').join(
    df_data[['ID', 'A']].rdd\
        .map(lambda x: (x.ID, json.loads(x.A)) 
             if x.A is not None else (x.ID, None))\
        .toDF()\
        .withColumnRenamed('_1', 'ID')\
        .withColumnRenamed('_2', 'A'),
    ['ID']
)

The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.

With pandas All I'd do something like this:

pdf = pd.DataFrame([json.dumps([0]*np.random.randint(5,10)) for i in range(10)], columns=['A'])
pdf.A = pdf.A.map(lambda x: json.loads(x))
pdf

but the following does not work in pyspark:

df.A = df[['A']].rdd.map(lambda x: json.loads(x.A))

So is there an easier way than what I'm doing in my first code snipped?


回答1:


I do not think you need to drop the column and do the join. The following code should* be equivalent to what you posted:

cols = df_data.columns
df = df_data.rdd\
    .map(
        lambda row: tuple(
            [row[c] if c != 'A' else (json.loads(row[c]) if row[c] is not None else None) 
             for c in cols]
        )
    )\
    .toDF(cols)

*I haven't actually tested this code, but I think this should work.

But to answer your general question, you can transform a column in-place using withColumn().

df = df_data.withColumn("A", my_transformation_function("A").alias("A"))

Where my_transformation_function() can be a udf or a pyspark sql function.




回答2:


From what i could understand, is it something like this you are trying to achieve?

import pyspark.sql.functions as F
import json

json_convert = F.udf(lambda x: json.loads(x) if x is not None else None)

cols = df_data.columns
df = df_data.select([json_convert(F.col('A')).alias('A')] + \
                    [col for col in cols if col != 'A'])


来源:https://stackoverflow.com/questions/49173600/transforming-a-column-and-update-the-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!