问题
Suppose I have an external table like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
A few of my documents have an invalid profile.score
(a string rather than an integer).
This causes queries in Athena to fail:
"Status": { "State": "FAILED", "StateChangeReason": "HIVE_BAD_DATA: Error parsing field value for field 0: For input string: \"4099999.9999999995\"",
How can I configure Athena to skip the documents that do not fit the external table schema?
The question here is about finding the problematic documents; this question is about skipping them.
回答1:
Here is a sample on how to exclude a particular file
SELECT
*
FROM
"some_database"."some_table"
WHERE(
"$PATH" != 's3://path/to/a/file'
)
Just tested this approach with
SELECT
COUNT(*)
FROM
"some_database"."some_table"
-- Result: 68491573
SELECT
COUNT(*)
FROM
"some_database"."some_table"
WHERE(
"$PATH" != 's3://path/to/a/file'
)
-- Result: 68041452
SELECT
COUNT(*)
FROM
"some_database"."some_table"
WHERE(
"$PATH" = 's3://path/to/a/file'
)
-- Result: 450121
Total: 450121 + 68041452 = 68491573
回答2:
I have faced same issue. Since I could not found a specific solution, I have used a different approach. It might help you. The error is related to bad data in profile field. Since you are using “struct” for profile field, Athena is expecting the profile field’s data in structured fashion in source files. If there is any bad data in this field, you will experience this error.
Can you try below queries:
CREATE EXTERNAL TABLE my.data (
id string,
timestamp string,
profile string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
and use below query to get expected result
select
id
,timestamp
,socialdata
,json_extract_scalar(profile, '$.name')profile_name
,json_extract_scalar(profile, '$.score')profile_score
from my.data;
You can visit this link for more.
来源:https://stackoverflow.com/questions/58936088/how-to-skip-documents-that-do-not-match-schema-in-athena