how to manage many avsc files in flink when consuming multiple topics gracefully

南笙酒味 提交于 2021-02-10 18:26:20

问题


Here is my case: I use flink to consume many topics in Kafka with SimpleStringSchema. OutputTag is used since we need to bucket the data in Parquet + Snappy into directories by topic later. Then we go through all the topics while each topic is processed with AVSC schema file.

Now I have to modify the avsc schema file when some new columns added. It'll make me in trouble when ten or hundred files needed to modify.

So is there a more graceful way to avoid changing the avsc file or how to manage them better?


回答1:


In general, I'd avoid ingesting data with different schemas in the same source. That is especially true for multiple schemas within the same topic.

A common and scalable way to avoid it is to use some kind of envelope format.

{
  "namespace": "example",
  "name": "Envelope",
  "type": "record",
  "fields": [
    {
      "name": "type1",
      "type": ["null", {
        "type": "record",
        "fields": [ ... ]
      }],
      "default": null
    },
    {
      "name": "type2",
      "type": ["null", {
        "type": "record",
        "fields": [ ... ]
      }],
      "default": null
    }
  ]
}

This envelope is evolvable (arbitrary addition/removal of wrapped types, which by themselves can be evolved), and adds only a little overhead (1 byte per subtype). The downside is that you cannot enforce that exactly one of the subtypes is set.

This schema is fully compatible with the schema registry, so no need to parse anything manually.



来源:https://stackoverflow.com/questions/59466651/how-to-manage-many-avsc-files-in-flink-when-consuming-multiple-topics-gracefully

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!