ClickHouse

How to avoid merging high cardinality sub-select aggregations on distributed tables

与世无争的帅哥 提交于 2020-01-23 02:44:04
问题 In Clickhouse, I have a large table A with following columns: date, user_id, operator, active In table A, events are already pre-aggregated over date, user_id and operator, while column 'active' indicates presence of certain kind of activity of user on given date. Table A is distributed over 2 shards/servers: First I created table A_local on each server (PK is date, user_id). Then I created distributed table A to merge local tables A_local by using hash(userid, operator) as sharding key. User

How to make clickhouse take new users.xml file?

可紊 提交于 2019-12-20 03:54:31
问题 Do I have to restart clickhouse to make it read any update to users.xml? Is there a way to juse "reload" clickhouse? 回答1: These files are reloaded in runtime, no need to restart the server. As you can notice config folder has several files, like config-preprocessed.xml config.xml users-preprocessed.xml users.xml .*-preprocessed.xml are for parsed config so you will see when it is loaded and parsed. 回答2: I wouldn't recommend to modify files ' /etc/clickhouse-server/config.xml ' or ' etc

clickhouse 表引擎相关

放肆的年华 提交于 2019-12-18 14:30:42
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> clickhouse 表引擎 MergeTree 类型 MergeTree(合并树) 主要特点: 存储按主键排序的数据。 这使您可以创建一个小的稀疏索引,以帮助更快地查找数据。 如果指定了分区键,则可以使用分区。 ClickHouse支持某些分区操作,这些操作比对相同数据,相同结果的常规操作更有效。 ClickHouse还会自动切断在查询中指定了分区键的分区数据。这也提高了查询性能。 数据复制支持。 ReplicatedMergeTree表族提供数据复制。有关 更多信息 ,请参见数据复制。 数据采样支持。 如有必要,可以在表中设置数据采样方法。 参考链接 该MergeTree系列(*MergeTree)的引擎和其他引擎是最强大的ClickHouse表引擎。 该MergeTree系列中的引擎旨在将大量数据插入表中。数据迅速地逐部分写入表中,然后应用规则在后台合并这些部分。这种方法比插入期间连续重写存储中的数据效率更高。 ReplacingMergeTree(替换合并树) 该引擎与MergeTree的不同之处在于,它删除具有相同主键值(或更准确地说,具有相同排序键值)的重复条目。 重复数据删除仅在合并期间发生。合并发生在后台的未知时间,因此您无法为此计划。某些数据可能仍未处理

Clickhouse import data from csv DB::NetException: Connection reset by peer, while writing to socket

落爺英雄遲暮 提交于 2019-12-11 08:33:47
问题 I'm trying to load *.gz file to Clickhouse through: clickhouse-client --max_memory_usage=15323460608 --format_csv_delimiter="|" --query="INSERT INTO tmp1.my_test)table FORMAT CSV" I"m getting the error: Code: 210. DB::NetException: Connection reset by peer, while writing to socket (127.0.0.1:9000) . No errors in clickhouse-server.log , clickhouse-server.err.log or zookeeper.log When I run the insert command I see the memory is getting almost the limit of the server ( 32Gb) this is why I tried

How to avoid duplicates in clickhouse table?

烂漫一生 提交于 2019-12-11 07:19:57
问题 I have created table and trying to insert the values multiple time to check the duplicates. I can see duplicates are inserting. Is there a way to avoid duplicates in clickhouse table? CREATE TABLE sample.tmp_api_logs ( id UInt32, EventDate Date) ENGINE = MergeTree(EventDate, id, (EventDate,id), 8192); insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23'); insert into sample.tmp_api_logs values(1,'2018-11-23'),(2,'2018-11-23'); select * from sample.tmp_api_logs; ┌─id─┬─

Clickhouse Data Import

拜拜、爱过 提交于 2019-12-10 16:43:32
问题 I created a table in Clickhouse: CREATE TABLE stock ( plant Int32, code Int32, service_level Float32, qty Int32 ) ENGINE = Log there is a data file :~$ head -n 10 /var/rs_mail/IN/qv_stock_20160620035119.csv 2010,646,1.00,13 2010,2486,1.00,19 2010,8178,1.00,10 2010,15707,1.00,4 2010,15708,1.00,10 2010,15718,1.00,4 2010,16951,1.00,8 2010,17615,1.00,13 2010,17616,1.00,4 2010,17617,1.00,8 I am trying to load data: :~$ cat /var/rs_mail/IN/qv_stock_20160620035119.csv | clickhouse-client --query=

How to group by time bucket in ClickHouse and fill missing data with nulls/0s

痴心易碎 提交于 2019-12-10 14:16:13
问题 Suppose I have a given time range. For explanation, let's consider something simple, like whole year 2018. I want to query data from ClickHouse as a sum aggregation for each quarter so the result should be 4 rows. The problem is that I have data for only two quarters so when using GROUP BY quarter , only two rows are returned. SELECT toStartOfQuarter(created_at) AS time, sum(metric) metric FROM mytable WHERE created_at >= toDate(1514761200) AND created_at >= toDateTime(1514761200) AND created

Data directory permissions on host for Clickhouse installation via docker

巧了我就是萌 提交于 2019-12-08 11:21:34
问题 My setup for clickhouse is via docker (https://hub.docker.com/r/yandex/clickhouse-server/~/dockerfile/). Currently, I am running some issues when mounting the data directory (/var/lib/clickhouse) from the container to the host machine as I want to persist the data outside of the container runtime. Since the docker process is responsible for creating the directories on the host (these directories for /var/lib/clickhouse do not exist until running docker with a -v flag), what are the

How to create primary keys in ClickHouse

独自空忆成欢 提交于 2019-12-07 05:42:12
问题 I did found few examples in the documentation where primary keys are created by passing parameters to ENGINE section. But I did not found any description about any argument to ENGINE, what it means and how do I create a primary key. Thanks in advance. It would be great to add this info to the documentation it it's not present. 回答1: Primary key is supported for MergeTree storage engines family. https://clickhouse.yandex/reference_en.html#MergeTree Note that for most serious tasks, you should

ClickHouse Kafka Performance

我的未来我决定 提交于 2019-12-06 03:44:24
问题 Following the example from the documentation: https://clickhouse.yandex/docs/en/table_engines/kafka/ I created a table with Kafka Engine and a materialized view that pushes data to a MergeTree table. Here the structure of my tables: CREATE TABLE games ( UserId UInt32, ActivityType UInt8, Amount Float32, CurrencyId UInt8, Date String ) ENGINE = Kafka('XXXX.eu-west-1.compute.amazonaws.com:9092,XXXX.eu-west-1.compute.amazonaws.com:9092,XXXX.eu-west-1.compute.amazonaws.com:9092', 'games', 'click