How to handle embed line breaks in AWS Athena

微笑、不失礼 提交于 2020-06-25 10:03:17

问题


I have created a table in AWS Athena like this:

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
 'separatorChar' = ',',
 'quoteChar' = '\"',
 'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'

In the bucket I put a simple CSV file with the following context:

rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"

When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10; then Athena returns the following response:

How should I set ROW FORMAT to properly handle line breaks within the field values? So that rec3\ncol2 appears in col2.


回答1:


The problem here is that the OpenCSV Serializer-Deserializer

Does not support embedded line breaks in CSV files.

See this documentation from AWS.

However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.

Edit: Still working on the syntax for dealing with the embedded line feed \n. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"* worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3. I continue working on it.

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';


来源:https://stackoverflow.com/questions/47936191/how-to-handle-embed-line-breaks-in-aws-athena

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!