impala

RImpala: Query Failed When Larger Data

感情迁移 提交于 2019-12-25 01:55:52
问题 check1<-rimpala.query("select * from sum2") Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.sql.SQLException: Method not supported dim(sum2) is 49501 rows and 18 columns. check1<-rimpala.query("select *from sum3") dim(sum3) is 102 rows and 6 columns. It worked with smaller sample size. sorry that I cant reproduce example to this. Is anyone encounter the same problem with larger data size? Any idea to solve this? Thanks. 回答1: As noted elsewhere on StackOverflow,

Partition by in Impala SQL throwing an error

北慕城南 提交于 2019-12-25 01:03:13
问题 I am trying to calculate the running total of loss by the months on Impala using TOAD The following query is throwing the error - select list expression not produced by aggregation output (missing from group by clause ) select segment, year(open_dt) as open_year, months, sum(balance) sum(loss) over (PARTITION by segment,year(open_dt) order by months) as NCL from tableperf where year(open_dt) between 2015 and 2018 group by 1,2,3 回答1: You are mixing aggregation and window functions. I think you

REGEXP_EXTRACT in Impala

跟風遠走 提交于 2019-12-25 00:35:12
问题 I am trying to figure out how to extract customer ID from string that looks loke this: {"param":"success","value":"10","level":"0","error_code":"101","customer_id":"5b0e9b23e423b0d33c9f7ddfd", "purchases": "13", "last_activity_ts": "123523465"} I am trying to extract customer ID from strings that contain error code 101 with following code: select regexp_extract(field, '\"customer_id":"(.*)', 0) from table_name where field rlike '"error_code":"101"' But this gives me a following result:

Unable to insert 5k/sec records into impala?

不想你离开。 提交于 2019-12-25 00:15:45
问题 I am exploring Impala for a POC, however I can't see any significant performance. I can't insert 5000 records/sec, at max I was able to insert mere 200/sec. This is really slow considering any database performance. I tried two different methods but both are slow: Using Cloudera First, I installed Cloudera on my system and added latest CDH 6.2 cluster. I created a java client to insert data using ImpalaJDBC41 driver. I am able to insert record but speed is terrible. I tried tuning impala by

Impala - Get for all tables in database concentenated columns

白昼怎懂夜的黑 提交于 2019-12-24 10:39:40
问题 Lets say I have a database A with tables B1 and B2. B1 has columns C1 and C2 and B2 has columns D1, D2 and D3. I am looking for an Impala query that yields the following desired output: B1 | "C1+C2" B2 | "D1+D2+D3" where "D1+D2+D3" and "C1+C2" are concatenated strings. 回答1: Do you want the concatenated columns in a new table? Or do you want to add the concatenated columns to your existing tables? Either way, you can use the code below in impala to concatenated columns: SELECT CONCAT(C1,C2) AS

How to load data to Hive table and make it also accessible in Impala

回眸只為那壹抹淺笑 提交于 2019-12-24 08:00:09
问题 I have a table in Hive: CREATE EXTERNAL TABLE sr2015( creation_date STRING, status STRING, first_3_chars_of_postal_code STRING, intersection_street_1 STRING, intersection_street_2 STRING, ward STRING, service_request_type STRING, division STRING, section STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( 'colelction.delim'='\u0002', 'field.delim'=',', 'mapkey.delim'='\u0003', 'serialization.format'=',', 'skip.header.line.count'='1', 'quoteChar'= "\""

Impala: ERROR: AnalysisException: Partition spec does not exist:

江枫思渺然 提交于 2019-12-24 02:18:49
问题 I am trying to query for: show files in tableA partition (column_key1=value1, column_key2=value2) However, that throws off an error: ERROR: AnalysisException: Partition spec does not exist: (column_key1=value1, column_key2=value2) whereas this partition does exist, and also has the necessary files I'm looking in the table. The objective is to first check if the partition exists, if so, show the files in that partition. Related: https://community.cloudera.com/t5/Interactive-Short-cycle-SQL

Impala和Hive的关系(详解)

我只是一个虾纸丫 提交于 2019-12-23 21:25:20
Impala和Hive的关系    Impala是基于Hive的大数据实时分析查询引擎 ,直接使用Hive的元数据库Metadata,意味着impala元数据都存储在Hive的metastore中。并且impala兼容Hive的sql解析,实现了Hive的SQL语义的子集,功能还在不断的完善中。 与Hive的关系   Impala 与Hive都是构建在Hadoop之上的数据查询工具各有不同的侧重适应面,但从客户端使用来看Impala与Hive有很多的共同之处,如数据表元数 据、ODBC/JDBC驱动、SQL语法、灵活的文件格式、存储资源池等。 Impala与Hive在Hadoop中的关系如下图 所示。 Hive适合于长时间的批处理查询分析 , 而Impala适合于实时交互式SQL查询 ,Impala给数据分析人员提供了快速实验、验证想法的大数 据分析工具。可以先使用hive进行数据转换处理,之后使用Impala在Hive处理后的结果数据集上进行快速的数据分析。              Impala相对于Hive所使用的优化技术 1、没有使用 MapReduce进行并行计算,虽然MapReduce是非常好的并行计算框架,但它更多的面向批处理模式,而不是面向交互式的SQL执行。与 MapReduce相比:Impala把整个查询分成一执行计划树,而不是一连串的MapReduce任务

Pentaho Mondrian Schema: left join of fact table with dimension table

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-23 01:07:28
问题 I have integrated Pentaho5 EE with Impala. In my schema dimension values are not gathered from the fact table as it is a huge table and it takes too long to calculate them. Since dimension values come dimension tables Mondrian compiles a query which does a join of dimension table with fact table in that order (i.e dimension table on the left). The query this way is slow and I read on the Cloudera website that if you do a join in Impala the bigger table (the fact table) has to be on the right.

What are the fundamental architectural, SQL compliance, and data use scenario differences between Presto and Impala?

本小妞迷上赌 提交于 2019-12-22 06:29:17
问题 Can some experts give some succinct answers to the differences between Presto and Impala from these perspectives? Fundamental architecture design SQL compliance Real-world latency Any SPOF or fault-tolerance functionality Structured and unstructured data use scenario performance 来源: https://stackoverflow.com/questions/19841027/what-are-the-fundamental-architectural-sql-compliance-and-data-use-scenario-di