Apache Drill using Google Cloud Storage

前端 未结 2 778
温柔的废话
温柔的废话 2021-01-21 14:15

The Apache Drill features list mentions that it can query data from Google Cloud Storage, but I can\'t find any information on how to do that. I\'ve got it working fine with S3

相关标签:
2条回答
  • 2021-01-21 15:01

    I managed to query parquet data in Google Cloud Storage (GCS) using Apache Drill (1.6.0) running on a Google Dataproc cluster. In order to set that up, I took the following steps:

    1. Install Drill and make the GCS connector accessible (this can be used as an init-script for dataproc, just note it wasn't really tested and relies on a local zookeeper instance):

      #!/bin/sh
      set -x -e
      BASEDIR="/opt/apache-drill-1.6.0"
      mkdir -p ${BASEDIR}
      cd ${BASEDIR}
      wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
      tar -xzvf apache-drill-1.6.0.tar.gz
      mv apache-drill-1.6.0/* .
      rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
      
      ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
      mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
      ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
      
      drillbit.sh start
      
      set +x +e
      
    2. Connect to the Drill console, create a new storage plugin (call it, say, gcs), and use the following configuration (note I copied most of it from the s3 config, made minor changes):

      {
        "type": "file",
        "enabled": true,
        "connection": "gs://myBucketName",
        "config": null,
        "workspaces": {
          "root": {
            "location": "/",
            "writable": false,
            "defaultInputFormat": null
          },
          "tmp": {
            "location": "/tmp",
            "writable": true,
            "defaultInputFormat": null
          }
        },
        "formats": {
          "psv": {
            "type": "text",
            "extensions": [
              "tbl"
            ],
            "delimiter": "|"
          },
          "csv": {
            "type": "text",
            "extensions": [
              "csv"
            ],
            "delimiter": ","
          },
          "tsv": {
            "type": "text",
            "extensions": [
              "tsv"
            ],
            "delimiter": "\t"
          },
          "parquet": {
            "type": "parquet"
          },
          "json": {
            "type": "json",
            "extensions": [
              "json"
            ]
          },
          "avro": {
            "type": "avro"
          },
          "sequencefile": {
            "type": "sequencefile",
            "extensions": [
              "seq"
            ]
          },
          "csvh": {
            "type": "text",
            "extensions": [
              "csvh"
            ],
            "extractHeader": true,
            "delimiter": ","
          }
        }
      }
      
    3. Query using the following syntax (note the backticks):

      select * from gs.`root`.`path/to/data/*` limit 10;
      
    0 讨论(0)
  • 2021-01-21 15:20

    This is quite an old question, so I imagine you either found a solution or moved on with your life, but for anyone looking for a solution without using Dataproc, here's a solution:

    1. Add the JAR file from the GCP connectors to the jars/3rdparty directory.
    2. Add the following to the site-core.xml file in the conf directory (change the upper-case values such as YOUR_PROJECT_ID to your own details):
    <property>
        <name>fs.gs.project.id</name>
        <value>YOUR_PROJECT_ID</value>
        <description>
          Optional. Google Cloud Project ID with access to GCS buckets.
          Required only for list buckets and create bucket operations.
        </description>
      </property>
      <property>
        <name>fs.gs.auth.service.account.private.key.id</name>
        <value>YOUR_PRIVATE_KEY_ID</value>
      </property>
        <property>
            <name>fs.gs.auth.service.account.private.key</name>
            <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
        </property>
      <property>
        <name>fs.gs.auth.service.account.email</name>
        <value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
        <description>
          The email address is associated with the service account used for GCS
          access when fs.gs.auth.service.account.enable is true. Required
          when authentication key specified in the Configuration file (Method 1)
          or a PKCS12 certificate (Method 3) is being used.
        </description>
      </property>
      <property>
        <name>fs.gs.working.dir</name>
        <value>/</value>
        <description>
          The directory relative gs: uris resolve in inside of the default bucket.
        </description>
      </property>
       <property>
        <name>fs.gs.implicit.dir.repair.enable</name>
        <value>true</value>
        <description>
          Whether or not to create objects for the parent directories of objects
          with / in their path e.g. creating gs://bucket/foo/ upon deleting or
          renaming gs://bucket/foo/bar.
        </description>
      </property>
       <property>
        <name>fs.gs.glob.flatlist.enable</name>
        <value>true</value>
        <description>
          Whether or not to prepopulate potential glob matches in a single list
          request to minimize calls to GCS in nested glob cases.
        </description>
      </property>
       <property>
        <name>fs.gs.copy.with.rewrite.enable</name>
        <value>true</value>
        <description>
          Whether or not to perform copy operation using Rewrite requests. Allows
          to copy files between different locations and storage classes.
        </description>
      </property>
    

    Start Apache Drill.

    Add a custom storage to Drill.

    You're good to go.

    The solution is from here, where I detail some more about what we do around data exploration with Apache Drill.

    0 讨论(0)
提交回复
热议问题