How do I identify problematic documents in S3 when querying data in Athena?

我怕爱的太早我们不能终老 提交于 2019-12-13 01:43:04

问题


I have a basic Athena query like this:

SELECT *
FROM my.dataset LIMIT 10

When I try to run it I get an error message like this:

Your query has the following error(s):

HIVE_BAD_DATA: Error parsing field value for field 2: For input string: "32700.000000000004"

How do I identify the S3 document that has the invalid field?


My documents are JSON.

My table looks like this:

CREATE EXTERNAL TABLE my.data (
  `id` string,
  `timestamp` string,
  `profile` struct<
    `name`: string,
    `score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');

回答1:


Inconsistent schema

Inconsistent schema is when values in some rows are of different data type. Let's assume that we have two json files

// inside s3://path/to/bad.json
{"name":"1Patrick", "age":35}
{"name":"1Carlos",  "age":"eleven"}
{"name":"1Fabiana", "age":22}

// inside s3://path/to/good.json
{"name":"2Patrick", "age":35}
{"name":"2Carlos",  "age":11}
{"name":"2Fabiana", "age":22}

Then a simple query SELECT * FROM some_table will fail with

HIVE_BAD_DATA: Error parsing field value 'eleven' for field 1: For input string: "eleven"

However, we can exclude that file within WHERE clause

SELECT 
    "$PATH" AS "source_s3_file", 
    * 
FROM some_table 
WHERE "$PATH" != 's3://path/to/bad.json'

Result:

        source_s3_file | name     | age
---------------------------------------
s3://path/to/good.json | 1Patrick | 35
s3://path/to/good.json | 1Carlos  | 11
s3://path/to/good.json | 1Fabiana | 22

Of course, this is the best case scenario when we know which files are bad. However, you can employ this approach to somewhat manually infer which files are good. You can also use LIKE or regexp_like to walk through multiple files at a time.

SELECT 
    COUNT(*)
FROM some_table 
WHERE regexp_like("$PATH",  's3://path/to/go[a-z]*.json')
-- If this query doesn't fail, that those files are good.

The obvious drawback of such approach is cost to execute query and time spent, especially if it is done file by file.

Malformed records

In the eyes of AWS Athena, good records are those which are formatted as a single JSON per line:

{ "id" : 50, "name":"John" }
{ "id" : 51, "name":"Jane" }
{ "id" : 53, "name":"Jill" }

AWS Athena supports OpenX JSON SerDe library which can be set to evaluate malformed records as NULL by specifying

-- When you create table
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')

when you create table. Thus, the following query will reveal files with malformed records:

SELECT 
    DISTINCT("$PATH")
FROM "some_database"."some_table" 
WHERE(
    col_1 IS NULL AND 
    col_2 IS NULL AND 
    col_3 IS NULL
    -- etc
)

Note: you can use only a single col_1 IS NULL if you are 100% sure that it doesn't contain empty fields other then in corrupted rows.

In general, malformed records are not that big of a deal provided that 'ignore.malformed.json' = 'true'. For example the following query will still succeed For example if a file contains:

{"name": "2Patrick","age": 35,"address": "North Street"}
{
    "name": "2Carlos",
    "age": 11,
    "address": "Flowers Street"
}
{"name": "2Fabiana","age": 22,"address": "Main Street"}

the following query will still succeed

SELECT 
    "$PATH" AS "source_s3_file",
    *
FROM some_table

Result:

              source_s3_file |     name | age | address
-----------------------------|----------|-----|-------------
1 s3://path/to/malformed.json| 2Patrick | 35  | North Street
2 s3://path/to/malformed.json|          |     |
3 s3://path/to/malformed.json|          |     |
4 s3://path/to/malformed.json|          |     |
5 s3://path/to/malformed.json|          |     |
6 s3://path/to/malformed.json|          |     |
7 s3://path/to/malformed.json| 2Fabiana | 22  | Main Street

While with 'ignore.malformed.json' = 'false' (which is the default behaviour) exactly the same query will throw an error

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]



来源:https://stackoverflow.com/questions/58919242/how-do-i-identify-problematic-documents-in-s3-when-querying-data-in-athena

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!