apache-drill

Apache Drill vs Spark

蹲街弑〆低调 提交于 2019-12-03 02:15:54
I have some expirience with Apache Spark and Spark-SQL. Recently I've found Apache Drill project. Could you describe me what are the most significant advantages/differences between them? I've already read Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill) but this topic is still unclear for me. Here's an article I came across that discusses some of the SQL technologies: http://www.zdnet.com/article/sql-and-hadoop-its-complicated/ Drill is fundamentally different in both the user's experience and the architecture. For example: Drill is a schema-free query engine. For

Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

房东的猫 提交于 2019-12-02 14:06:45
I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS. My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with either one of those? Not only concerning performance, but also with respect of stability? Comparison between Hive and Impala or Spark or Drill sometimes sounds inappropriate to me. The goals behind developing Hive and these tools were different. Hive was never developed for real-time, in memory processing and is based on MapReduce. It was built for

Apache Drill using Google Cloud Storage

谁都会走 提交于 2019-12-02 01:55:28
问题 The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can't find any information on how to do that. I've got it working fine with S3, but suspect i'm missing something very simple in terms of Google Cloud Storage. Does anyone have an example Storage Plugin configuration for Google Cloud Storage? Thanks M 回答1: I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set

unable to query on RDBMS using apache drill

烈酒焚心 提交于 2019-12-01 09:12:37
With apache drill 1.2, we can query over RDBMS data. Check more here: https://drill.apache.org/blog/2015/10/16/drill-1.2-released/ so, I tried to add a plugin for MySQL. I am doing it using the web client. I created a plugin with name mysql and added following configurations: { "type": "jdbc", "driver": "com.mysql.jdbc.Driver", "uri": "jdbc:mysql://<IP>:3306/classicmodels", "username": "root", "password": "root", "enabled": true } Also, I added mysql.jar in /apache-drill-1.2.0/jars/3rdparty It is showing error: (Invalid JSON mapping) Any pointer on this. Is there any documentation for that?

unable to query on RDBMS using apache drill

梦想的初衷 提交于 2019-12-01 06:47:15
问题 With apache drill 1.2, we can query over RDBMS data. Check more here: https://drill.apache.org/blog/2015/10/16/drill-1.2-released/ so, I tried to add a plugin for MySQL. I am doing it using the web client. I created a plugin with name mysql and added following configurations: { "type": "jdbc", "driver": "com.mysql.jdbc.Driver", "uri": "jdbc:mysql://<IP>:3306/classicmodels", "username": "root", "password": "root", "enabled": true } Also, I added mysql.jar in /apache-drill-1.2.0/jars/3rdparty

Integrating Spark SQL and Apache Drill through JDBC

丶灬走出姿态 提交于 2019-11-30 09:53:56
I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC: Map<String, String> connectionOptions = new HashMap<String, String>(); connectionOptions.put("url", args[0]); connectionOptions.put("dbtable", args[1]); connectionOptions.put("driver", "org.apache.drill.jdbc.Driver"); DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load(); Spark SQL performs two queries: the first one to get the schema, and the second one to retrieve the

Integrating Spark SQL and Apache Drill through JDBC

馋奶兔 提交于 2019-11-29 15:08:03
问题 I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC: Map<String, String> connectionOptions = new HashMap<String, String>(); connectionOptions.put("url", args[0]); connectionOptions.put("dbtable", args[1]); connectionOptions.put("driver", "org.apache.drill.jdbc.Driver"); DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();

Apache Drill - connection to Drill in Embedded Mode [java]

丶灬走出姿态 提交于 2019-11-29 10:52:01
I want to connect to Drill by Java app, and so far I was trying to use JDBC to do it and I'm using example from https://github.com/vicenteg/DrillJDBCExample , but... when I change DB_URL static variable to "jdbc:drill:zk=local" and start app i get exception: java.sql.SQLNonTransientConnectionException: Running Drill in embedded mode using Drill's jdbc-all JDBC driver Jar file alone is not supported. and so far I didn't found any workaround. Any idea how to connect to Drill in embedded mode? I don't want to set up distributed mode so far. There is truly not much about it on the web. Any help

Apache Drill has bad performance against SQL Server

十年热恋 提交于 2019-11-29 09:16:53
I tried using apache-drill to run a simple join-aggregate query and the speed wasn't really good. my test query was: SELECT p.Product_Category, SUM(f.sales) FROM facts f JOIN Product p on f.pkey = p.pkey GROUP BY p.Product_Category Where facts has about 422,000 rows and product has 600 rows. the grouping comes back with 4 rows. First I tested this query on SqlServer and got a result back in about 150ms. With drill I first tried to connect directly to SqlServer and run the query, but that was slow (about 5 sec). Then I tried saving the tables into json files and reading from them, but that was

Write Drill query output to csv (or some other format)

。_饼干妹妹 提交于 2019-11-29 07:02:38
I'm using drill in embedded mode, and I can't figure out how to save query output other than copy and pasting it. If you're using sqlline, you can create a new table as CSV as follows: use dfs.tmp; alter session set `store.format`='csv'; create table dfs.tmp.my_output as select * from cp.`employee.json`; Your CSV file(s) will appear in /tmp/my_output. You can specify !record <file_path> to save all output to particular file. Drill docs Andrew Scott Evans If you are using SQLLINE use !record . If you are using a set of queries, you need to specify the exact schema to use. This can be done using