How to skip documents that do not match schema in Athena?

家住魔仙堡 提交于 2020-01-06 02:36:05

问题


Suppose I have an external table like this:

CREATE EXTERNAL TABLE my.data (
  `id` string,
  `timestamp` string,
  `profile` struct<
    `name`: string,
    `score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');

A few of my documents have an invalid profile.score (a string rather than an integer).

This causes queries in Athena to fail:

"Status": { "State": "FAILED", "StateChangeReason": "HIVE_BAD_DATA: Error parsing field value for field 0: For input string: \"4099999.9999999995\"",

How can I configure Athena to skip the documents that do not fit the external table schema?


The question here is about finding the problematic documents; this question is about skipping them.


回答1:


Here is a sample on how to exclude a particular file

SELECT
   * 
FROM 
    "some_database"."some_table"
WHERE(
  "$PATH" != 's3://path/to/a/file'
)

Just tested this approach with

SELECT 
   COUNT(*)
FROM 
    "some_database"."some_table"
-- Result: 68491573

SELECT 
   COUNT(*)
FROM 
    "some_database"."some_table"
WHERE(
  "$PATH" != 's3://path/to/a/file'
)
-- Result: 68041452

SELECT 
   COUNT(*)
FROM 
    "some_database"."some_table"
WHERE(
  "$PATH" = 's3://path/to/a/file'
)
-- Result: 450121

Total: 450121 + 68041452 = 68491573




回答2:


I have faced same issue. Since I could not found a specific solution, I have used a different approach. It might help you. The error is related to bad data in profile field. Since you are using “struct” for profile field, Athena is expecting the profile field’s data in structured fashion in source files. If there is any bad data in this field, you will experience this error.

Can you try below queries:

CREATE EXTERNAL TABLE my.data (
id string,
timestamp string,
profile string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
  'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');

and use below query to get expected result

select 
 id
 ,timestamp
 ,socialdata
 ,json_extract_scalar(profile, '$.name')profile_name
 ,json_extract_scalar(profile, '$.score')profile_score
 from my.data;

You can visit this link for more.



来源:https://stackoverflow.com/questions/58936088/how-to-skip-documents-that-do-not-match-schema-in-athena

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!