How to avoid duplicates in clickhouse table?

烂漫一生 提交于 2019-12-11 07:19:57

问题


I have created table and trying to insert the values multiple time to check the duplicates. I can see duplicates are inserting. Is there a way to avoid duplicates in clickhouse table?

CREATE TABLE sample.tmp_api_logs ( id UInt32,  EventDate Date) ENGINE = MergeTree(EventDate, id, (EventDate,id), 8192);

insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23');
insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23');

select * from sample.tmp_api_logs;
┌─id─┬──EventDate─┐
│  1 │ 2018-11-23 │
│  2 │ 2018-11-23 │
└────┴────────────┘
┌─id─┬──EventDate─┐
│  1 │ 2018-11-23 │
│  2 │ 2018-11-23 │
└────┴────────────┘

回答1:


Most likely ReplacingMergeTree is what you need as long as duplicate records duplicate primary keys. You can also try out other MergeTree engines for more actions when replicate record is encountered. FINAL keyword can be used when doing queries to ensure uniquity.




回答2:


If raw data does not contain duplicates and they might appear only during retries of INSERT INTO, there's a deduplication feature in ReplicatedMergeTree. To make it work you should retry inserts of exactly the same batches of data (same set of rows in same order). You can use different replica for these retries and data block will still be inserted only once as block hashes are shared between replicas via ZooKeeper.

Otherwise, you should deduplicate data externally before inserts to ClickHouse or clean up duplicates asynchronously with ReplacingMergeTree or ReplicatedReplacingMergeTree.



来源:https://stackoverflow.com/questions/53442559/how-to-avoid-duplicates-in-clickhouse-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!