AWS Athena: Named boto3 queries not creating corresponding tables

拈花ヽ惹草 提交于 2021-01-29 10:24:06

问题


I have the following boto3 draft script

#!/usr/bin/env python3
import boto3

client = boto3.client('athena')

BUCKETS='buckets.txt'
DATABASE='some_db'
QUERY_STR="""CREATE EXTERNAL TABLE IF NOT EXISTS some_db.{}(
         BucketOwner STRING,
         Bucket STRING,
         RequestDateTime STRING,
         RemoteIP STRING,
         Requester STRING,
         RequestID STRING,
         Operation STRING,
         Key STRING,
         RequestURI_operation STRING,
         RequestURI_key STRING,
         RequestURI_httpProtoversion STRING,
         HTTPstatus STRING,
         ErrorCode STRING,
         BytesSent BIGINT,
         ObjectSize BIGINT,
         TotalTime STRING,
         TurnAroundTime STRING,
         Referrer STRING,
         UserAgent STRING,
         VersionId STRING,
         HostId STRING,
         SigV STRING,
         CipherSuite STRING,
         AuthType STRING,
         EndPoint STRING,
         TLSVersion STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
         'serialization.format' = '1', 'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$' )
LOCATION 's3://my-bucket/{}'"""

with open(BUCKETS, 'r') as f:
    lines = f.readlines()


for line in lines:
    query_string = QUERY_STR.format(line, line)
    response = client.create_named_query(
        Name=line,
        Database=DATABASE,
        QueryString=QUERY_STR
    )
    print(response)

When executed, all responses come back with status code 200.

Why am I not able to see the corresponding tables that should have been created?

Shouldn't I be able to (at least) see somewhere those queries stored?

update1: I am now trying to actually create the tables via the above queries as follows:

for line in lines:
    query_string = QUERY_STR.format(DATABASE, line[:-1].replace('-', '_'), line[:-1])
    try:
        response1 = client.start_query_execution(
            QueryString=query_string,
            WorkGroup=WORKGROUP,
            QueryExecutionContext={
                'Database': DATABASE
            },
            ResultConfiguration={
                'OutputLocation': OUTPUT_BUCKET,
            },
        )
        query_execution_id = response1['ResponseMetadata']['RequestId']
        print(query_execution_id)
    except Exception as e1:
        print(query_string)
        raise(e1)

Once again, the script does output some query ids (no error seems to take place), nonetheless no table is created.

I have also followed the advice of @John Rotenstein and initialised my boto3 client as follows:

client = boto3.client('athena', region_name='us-east-1')

回答1:


First of all, response simply tells you that your request has been successfully submitted. Method create_named_query() creates a snippet of your query, which then can be seen/access in AWS Athena console in Saved Queries tab.

It seems to me that you want to create table using boto3. If that is the case, you need to use start_query_execution() method.

Runs the SQL query statements contained in the Query . Requires you to have access to the workgroup in which the query ran.

Having response 200 out of start_query_execution doesn't guarantee that you query will get executed successfully. As I understand, this method does some simple pre-execution checks to validate syntax of the query. However, there are other things that could fail you query at the run time. For example if you try to create table in a database that doesn't exist, or if you try to create a table definition in a database to which you don't have access.

Here is an example, when I used you query string, formatted with with some random name for the table.

I got response 200 and got some value in response1['ResponseMetadata']['RequestId']. However, since I don't have some_db in AWS Glue catalog, this query failed at the run time, thus, no table was created.

Here is how you can track query execution within boto3

import time

response1 = client.start_query_execution(
    QueryString=query_string,
    WorkGroup=WORKGROUP,
    QueryExecutionContext={
        'Database': DATABASE
    },
    ResultConfiguration={
        'OutputLocation': OUTPUT_BUCKET,
    },
)
query_execution_id = response1['ResponseMetadata']['RequestId']

while True:
    time.sleep(1)
    response_2 = client.get_query_execution(
        QueryExecutionId=query_execution_id
    )
    query_status = response_2['QueryExecution']['Status']
    print(query_status)
    if query_status not in ["QUEUED", "RUNNING", "CANCELLED"]:
        break



回答2:


To reproduce your situation, I did the following:

  • In the Athena console, I ran:
CREATE DATABASE foo
  • In the Athena console, I selected foo in the Database drop-down
  • To start things simple, I ran this Python code:
import boto3

athena_client = boto3.client('athena', region_name='ap-southeast-2') # Change as necessary

QUERY_STR="""
CREATE EXTERNAL TABLE IF NOT EXISTS foo.bar(id INT) 
LOCATION 's3://my-bucket/input-files/'
"""

response = athena_client.start_query_execution(
    QueryString=QUERY_STR,
    QueryExecutionContext={'Database': 'foo'},
    ResultConfiguration={'OutputLocation': 's3://my-bucket/athena-out/'}
)
  • I then went to the Athena console, did a refresh, and confirmed that the bar table was created

Suggestion: Try the above to confirm that it works for you, too!

I then ran your code, using the start_query_execution version of your code (shown in your second code block). I had to make some changes:

  • I didn't have a buckets.txt file, so I just provided a list of names
  • Your code doesn't show the content of OUTPUT_BUCKET, so I used s3://my-bucket/athena-output/ (Does that match the format that you used?)
  • Your code uses QUERY_STR.format(DATABASE... but there was no {} in the QUERY_STR where the database name would be inserted, so I removed DATABASE as an input to the format variable
  • I did not provide a value for WORKGROUP

It all ran fine, creating multiple tables.

So, check the above bullet-points to see if it caused a problem for you (such as replacing the Database name in the format() statement).



来源:https://stackoverflow.com/questions/58736295/aws-athena-named-boto3-queries-not-creating-corresponding-tables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!