how to manage many avsc files in flink when consuming multiple topics gracefully

问题

Here is my case: I use flink to consume many topics in Kafka with SimpleStringSchema. OutputTag is used since we need to bucket the data in Parquet + Snappy into directories by topic later. Then we go through all the topics while each topic is processed with AVSC schema file.

Now I have to modify the avsc schema file when some new columns added. It'll make me in trouble when ten or hundred files needed to modify.

So is there a more graceful way to avoid changing the avsc file or how to manage them better?

回答1:

In general, I'd avoid ingesting data with different schemas in the same source. That is especially true for multiple schemas within the same topic.

A common and scalable way to avoid it is to use some kind of envelope format.

{
  "namespace": "example",
  "name": "Envelope",
  "type": "record",
  "fields": [
    {
      "name": "type1",
      "type": ["null", {
        "type": "record",
        "fields": [ ... ]
      }],
      "default": null
    },
    {
      "name": "type2",
      "type": ["null", {
        "type": "record",
        "fields": [ ... ]
      }],
      "default": null
    }
  ]
}

This envelope is evolvable (arbitrary addition/removal of wrapped types, which by themselves can be evolved), and adds only a little overhead (1 byte per subtype). The downside is that you cannot enforce that exactly one of the subtypes is set.

This schema is fully compatible with the schema registry, so no need to parse anything manually.

来源：https://stackoverflow.com/questions/59466651/how-to-manage-many-avsc-files-in-flink-when-consuming-multiple-topics-gracefully

标签

apache-flink

parquet