snappy

Kudu系列: Kudu主键选择策略

回眸只為那壹抹淺笑 提交于 2021-02-17 00:06:06
每个Kudu 表必须设置Pimary Key(unique), 另外Kudu表不能设置secondary index, 经过实际性能测试, 本文给出了选择Kudu主键的几个策略, 测试结果纠正了我之前的习惯认知. 简单介绍测试场景: 表中有一个unqiue字段Id, 另外还有一个日期维度字段histdate, 有三种设置kudu PK的方法, 分别是: 表设计方案1 (histdate, id)作为联合主键, 日期字段放在前. 表设计方案2 (id,histdate)作为联合主键, 日期字段放在后. 表设计方案3 (id)作为单字段主键. 先给出测试数据: 结论: 1. 选择性强的字段(比如 id 字段) 应该放在PK清单最前面, 这个规则对查询性能影响最大. 2. PK清单中只加必要的字段, 越少越好. 3. 如果查询针对PK中所有字段都加了条件, 其性能是最优的. 但只要有一个PK字段未加上条件, 就完全用不上PK索引,性能就很差. 4. where条件中各个字段条件的先后顺序并不关键. 5. Kudu表使用Java API Insert的速度还是很好的, 单线程达到了1万笔/秒多. Kudu Update 效率也很高, 实测对一个窄表做全字段update, 其速度达到了Insert速度的88%, 而vertica的update效率比insert差很多. 在测试之前的误区:

snappy-c.h: No such file or directory

允我心安 提交于 2021-02-11 17:06:11
问题 I am unable to install python-snappy with pip I have all the required packages, can someone please help me with it. # /opt/company/project/3.0.0.1/bin/pip3 install --index-url=http://pypi.company.local:9700 --trusted-host pypi.dvms.local python-snappy ...SNIP... running build_ext generating cffi module 'build/temp.linux-x86_64-3.6/snappy._snappy_cffi.c' creating build/temp.linux-x86_64-3.6 building 'snappy._snappy_cffi' extension creating build/temp.linux-x86_64-3.6/build creating build/temp

snappy-c.h: No such file or directory

风格不统一 提交于 2021-02-11 17:05:15
问题 I am unable to install python-snappy with pip I have all the required packages, can someone please help me with it. # /opt/company/project/3.0.0.1/bin/pip3 install --index-url=http://pypi.company.local:9700 --trusted-host pypi.dvms.local python-snappy ...SNIP... running build_ext generating cffi module 'build/temp.linux-x86_64-3.6/snappy._snappy_cffi.c' creating build/temp.linux-x86_64-3.6 building 'snappy._snappy_cffi' extension creating build/temp.linux-x86_64-3.6/build creating build/temp

Presto query error on hive ORC, Can not read SQL type real from ORC stream of type DOUBLE

ぃ、小莉子 提交于 2021-01-29 01:51:50
问题 I was able to run query in presto to read the non-float columns from Hive ORC(snappy) table. However, when I select all float datatype columns through the presto cli, gives me the below error message. Any suggestions what is the alternative other than changing the filed type to double in the targetHive table presto:sample> select * from emp_detail; Query 20200107_112537_00009_2zpay failed: Error opening Hive split hdfs://ip_address/warehouse/tablespace/managed/hive/sample.db/emp_detail/part

Presto query error on hive ORC, Can not read SQL type real from ORC stream of type DOUBLE

谁都会走 提交于 2021-01-29 01:43:44
问题 I was able to run query in presto to read the non-float columns from Hive ORC(snappy) table. However, when I select all float datatype columns through the presto cli, gives me the below error message. Any suggestions what is the alternative other than changing the filed type to double in the targetHive table presto:sample> select * from emp_detail; Query 20200107_112537_00009_2zpay failed: Error opening Hive split hdfs://ip_address/warehouse/tablespace/managed/hive/sample.db/emp_detail/part

Issue in using snappy with avro in python

◇◆丶佛笑我妖孽 提交于 2021-01-29 00:47:02
问题 I am reading the .gz file and converting to AVRO format. When I was using the codec='deflate' . It is working fine. i.e., I was able to convert to avro format. When I use codec='snappy' it is throwing an error stating below: raise DataFileException("Unknown codec: %r" % codec) avro.datafile.DataFileException: Unknown codec: 'snappy' with deflate --> working fine writer = DataFileWriter(open(avro_file, "wb"), DatumWriter(), schema, codec='deflate') with snappy --> throwing an error writer =

Issue in using snappy with avro in python

萝らか妹 提交于 2021-01-29 00:45:30
问题 I am reading the .gz file and converting to AVRO format. When I was using the codec='deflate' . It is working fine. i.e., I was able to convert to avro format. When I use codec='snappy' it is throwing an error stating below: raise DataFileException("Unknown codec: %r" % codec) avro.datafile.DataFileException: Unknown codec: 'snappy' with deflate --> working fine writer = DataFileWriter(open(avro_file, "wb"), DatumWriter(), schema, codec='deflate') with snappy --> throwing an error writer =

Issue in using snappy with avro in python

假如想象 提交于 2021-01-29 00:41:47
问题 I am reading the .gz file and converting to AVRO format. When I was using the codec='deflate' . It is working fine. i.e., I was able to convert to avro format. When I use codec='snappy' it is throwing an error stating below: raise DataFileException("Unknown codec: %r" % codec) avro.datafile.DataFileException: Unknown codec: 'snappy' with deflate --> working fine writer = DataFileWriter(open(avro_file, "wb"), DatumWriter(), schema, codec='deflate') with snappy --> throwing an error writer =

Set parquet snappy output file size is hive?

北城以北 提交于 2021-01-27 08:02:33
问题 I'm trying to split parquet/snappy files created by hive INSERT OVERWRITE TABLE... on dfs.block.size boundary as impala issues a warning when a file in a partition is larger then block size. impala logs the following WARNINGS: Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar) Code: CREATE TABLE <TABLE_NAME>(<FILEDS>) PARTITIONED BY ( year SMALLINT, month TINYINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'

Set parquet snappy output file size is hive?

若如初见. 提交于 2021-01-27 08:00:23
问题 I'm trying to split parquet/snappy files created by hive INSERT OVERWRITE TABLE... on dfs.block.size boundary as impala issues a warning when a file in a partition is larger then block size. impala logs the following WARNINGS: Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar) Code: CREATE TABLE <TABLE_NAME>(<FILEDS>) PARTITIONED BY ( year SMALLINT, month TINYINT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'