How to return complex types using spark UDFs

北慕城南 提交于 2019-12-23 11:51:23

问题


Hello and thank you in advance.

My program is written in java and i can not move to scala.

I am currently working with a spark DataFrame extracted from a json file using the following line:

DataFrame dff = sqlContext.read().json("filePath.son");

SQLContext and SparkContext are correctly initialzied and running perfectly.

The problem is the json i'm reading from has nested structs, and I want to clean/verify the inner data, without changing the schema.

One of the dataframe's columns in particular has "GenericRowWithSchema" type.

Let's say I want to clean that only column, named "data".

The solution that came to my mind was to define a User Defined Function (UDF) named "cleanDataField" and then run it over the column "data". Here's the code:

UDF1<GenericRowWithSchema,GenericRowWithSchema> cleanDataField = new UDF1<GenericRowWithSchema, GenericRowWithSchema>(){

        public GenericRowWithSchema call( GenericRowWithSchema grws){

            cleanGenericRowWithSchema(grws);

            return grws;

        }
    };

Then i would register the function in the SQLContext:

sqlContext.udf().register("cleanDataField", cleanDataField, DataTypes.StringType);

And after that I would call

df.selectExpr("cleanDataField(data)").show(10, false);

In order to see the first 10 rows with the clean data.

In the end, the question results in this: Can i return complex data (such as a custom class object)? And if it is possible, how should i do it? I guess I have to change the udf registration's 3rd parameter because i'm not returning a string, but what should i replace it for?

Thank you


回答1:


Let's say you want to construct a datatype as struct<companyid:string,loyaltynum:int,totalprice:int,itemcount:int>

For this you can do the following:

    // I am just copying the json string as is but you will need to escape it properly for java.

DataType dt = DataType.fromJson({"type":"struct","fields":[{"name":"companyid","type":"string","nullable":false,"metadata":{}},{"name":"loyaltynum","type":"integer","nullable":false,"metadata":{}},{"name":"totalprice","type":"integer","nullable":false,"metadata":{}},{"name":"itemcount","type":"integer","nullable":false,"metadata":{}}]})

You can then use that data type as return type while registering your UDF.




回答2:


I don't know if your question is still valid, but in case, here is the answer :

You need to replace the third argument with Encoders.bean(GenericRowWithSchema).schema()



来源:https://stackoverflow.com/questions/39750540/how-to-return-complex-types-using-spark-udfs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!