Apache Drill: Write general-purpose array_agg UDF

烈酒焚心 提交于 2020-08-10 19:18:57

问题


I would like to create an array_agg UDF for Apache Drill to be able to aggregate all values of a group to a list of values. This should work with any major types (required, optional) and minor types (varchar, dict, map, int, etc.)

However, I get the impression that Apache Drill's UDF API does not really make use of inheritance and generics. Each type has its own writer and handler, and they cannot be abstracted to handle any type. E.g., the ValueHolder interface seems to be purely cosmetic and cannot be used to have type-agnostic hooking of UDFs to any type.

My current implementation

I tried to solve this by using Java's reflection so I could use the ListHolder's write function independent of the holder of the original value.

However, I then ran into the limitations of the @FunctionTemplate annotation. I cannot create a general UDF annotation for any value (I tried it with the interface ValueHolder: @param ValueHolder input.

So to me it seems like the only way to support different types to have separate classes for each type. But I can't even abstract much and work on any @Param input, because input is only visible in the class where its defined (i.e. type specific).

I based my implementation on https://issues.apache.org/jira/browse/DRILL-6963 and created the following two classes for required and optional varchars (how can this be unified in the first place?)

@FunctionTemplate(
    name = "array_agg",
    scope = FunctionScope.POINT_AGGREGATE,
    nulls = NullHandling.INTERNAL
)
public static class VarChar_Agg implements DrillAggFunc {
    @Param org.apache.drill.exec.expr.holders.VarCharHolder input;
    @Workspace ObjectHolder agg;
    @Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;

    @Override
    public void setup() {
        agg = new ObjectHolder();
    }

    @Override
    public void reset() {
        agg = new ObjectHolder();
    }

    @Override public void add() {
        if (agg.obj == null) {
            // Initialise list object for output
            agg.obj = out.rootAsList();
        }

        org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
                (org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;

        listWriter.varChar().write(input);
    }

    @Override
    public void output() {
        ((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
    }
}

@FunctionTemplate(
    name = "array_agg",
    scope = FunctionScope.POINT_AGGREGATE,
    nulls = NullHandling.INTERNAL
)
public static class NullableVarChar_Agg implements DrillAggFunc {
    @Param NullableVarCharHolder input;
    @Workspace ObjectHolder agg;
    @Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;

    @Override
    public void setup() {
        agg = new ObjectHolder();
    }

    @Override
    public void reset() {
        agg = new ObjectHolder();
    }

    @Override public void add() {
        if (agg.obj == null) {
            // Initialise list object for output
            agg.obj = out.rootAsList();
        }

        if (input.isSet != 1) {
            return;
        }

        org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
                (org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;

        org.apache.drill.exec.expr.holders.VarCharHolder outHolder = new org.apache.drill.exec.expr.holders.VarCharHolder();
        outHolder.start = input.start;
        outHolder.end = input.end;
        outHolder.buffer = input.buffer;

        listWriter.varChar().write(outHolder);
    }

    @Override
    public void output() {
        ((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
    }
}

Interestingly, I can't import org.apache.drill.exec.vector.complex.writer.BaseWriter to make the whole thing easier because then Apache Drill would not find it. So I have to put the entire package path for everything in org.apache.drill.exec.vector.complex.writer in the code. Furthermore, I'm using the depcreated ObjectHolder. Any better solution?

Anyway: These work so far, e.g. with this query:

SELECT
    MIN(tbl.`timestamp`) AS start_view,
    MAX(tbl.`timestamp`) AS end_view,
    array_agg(tbl.eventLabel) AS label_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
WHERE tbl.data.slug IS NOT NULL
GROUP BY tbl.data.slug

however, when I use ORDER BY, I get this:

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: UnsupportedOperationException: NULL

Fragment 0:0

Additionally, I tried more complex types, namely maps/dicts. Interestingly, when I call SELECT sqlTypeOf(tbl.data) FROM tbl, I get MAP. But when I write UDFs, the query planner complains about having no UDF array_agg for type dict.

Anyway, I wrote a version for dicts:

@FunctionTemplate(
    name = "array_agg",
    scope = FunctionScope.POINT_AGGREGATE,
    nulls = NullHandling.INTERNAL
)
public static class Map_Agg implements DrillAggFunc {
    @Param MapHolder input;
    @Workspace ObjectHolder agg;
    @Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;

    @Override
    public void setup() {
        agg = new ObjectHolder();
    }

    @Override
    public void reset() {
        agg = new ObjectHolder();
    }

    @Override public void add() {
        if (agg.obj == null) {
            // Initialise list object for output
            agg.obj = out.rootAsList();
        }

        org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
                (org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
        //listWriter.copyReader(input.reader);
        input.reader.copyAsValue(listWriter);
    }

    @Override
    public void output() {
        ((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
    }
}

@FunctionTemplate(
        name = "array_agg",
        scope = FunctionScope.POINT_AGGREGATE,
        nulls = NullHandling.INTERNAL
)
public static class Dict_agg implements DrillAggFunc {
    @Param DictHolder input;
    @Workspace ObjectHolder agg;
    @Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;

    @Override
    public void setup() {
        agg = new ObjectHolder();
    }

    @Override
    public void reset() {
        agg = new ObjectHolder();
    }

    @Override public void add() {
        if (agg.obj == null) {
            // Initialise list object for output
            agg.obj = out.rootAsList();
        }

        org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
                (org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
        //listWriter.copyReader(input.reader);
        input.reader.copyAsValue(listWriter);
    }

    @Override
    public void output() {
        ((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
    }
}

But here, I get an empty list in the field data_agg for my query:

SELECT
    MIN(tbl.`timestamp`) AS start_view,
    MAX(tbl.`timestamp`) AS end_view,
    array_agg(tbl.data) AS data_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
GROUP BY tbl.data.viewSlag

Summary of questions

  • Most importantly: How do I create an array_agg UDF for Apache Drill?
  • How to make UDFs type-agnostic/general purpose? Do I really have to implement an entire class for each Nullable, Required and Repeated version of all types? That's a lot to do and quite tedious. Isn't there a way to handle values in an UDF agnostic to the underlying types? I wish Apache Drill would just use what Java offers here with function generic types, specialised function overloading and inheritence of their own type system. Am I missing something on how to do that?
  • How can I fix the NULL problem when I use ORDER BY on my varchar version of the aggregate?
  • How can I fix the problem where my aggregate of maps/dicts is an empty list?
  • Is there an alternative to using the deprecated ObjectHolder?

来源:https://stackoverflow.com/questions/62919727/apache-drill-write-general-purpose-array-agg-udf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!