Aws Athena - Create external table skipping first row

天涯浪子 提交于 2019-12-17 20:34:49

问题


I'm trying to create an external table on csv files with Aws Athena with the code below but the line TBLPROPERTIES ("skip.header.line.count"="1") doesn't work: it doesn't skip the first line (header) of the csv file.

CREATE EXTERNAL TABLE mytable
(
  colA string,
  colB int
  )

 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
 WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
   )
STORED AS TEXTFILE
LOCATION 's3://mybucket/mylocation/'
TBLPROPERTIES (
  "skip.header.line.count"="1")

Any advise?


回答1:


Just tried the "skip.header.line.count"="1" and seems to be working fine now.




回答2:


This is a feature that has not yet been implemented. See Abhishek@AWS' response here:

"We are working on it and will report back as soon as we have an outcome. Sorry for this again. This ended up taking longer than what we anticipated."

My workaround has been to preprocess the data before creating the table:

  1. download the csv file from S3
  2. strip the header using bash sed -e 1d -e 's/\"//g' file.csv > file-2.csv
  3. upload the results to its own folder on S3
  4. create the table



回答3:


I recently tried:

TBLPROPERTIES ('skip.header.line.count'='1')

And it works fine now. This issue arose when I had the column header as a string (timestamp) and the records where actual timestamps. My queries would bomb as it would scan the table and find a string instead of timestamp.

Something like this:

ts
2015-06-14 14:45:19.537
2015-06-14 14:50:20.546




回答4:


When this question was asked there was no support for skipping headers, and when it was later introduced it was only for the OpenCSVSerDe, not for LazySimpleSerDe, which is what you get when you specify ROW FORMAT DELIMITED FIELDS …. I think this is what has caused some confusion about whether or not it works in the answers to this question.




回答5:


On the AWS Console you can specify it as Serde parameters key-value keypair

While if you apply your infrastructure as code with terraform you can use ser_de_info parameter - "skip.header.line.count" = 1. Example bellow

resource "aws_glue_catalog_table" "banana_datalake_table" {
  name          = "mapping"
  database_name = "banana_datalake"
  table_type    = "EXTERNAL_TABLE"
    owner = "owner"
    storage_descriptor {
        location        = "s3://banana_bucket/"
        input_format    = "org.apache.hadoop.mapred.TextInputFormat"
        output_format   = "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
        compressed  = "false"
        number_of_buckets = -1    
        ser_de_info {
            name    = "SerDeCsv"
            serialization_library = "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
            parameters {
                "field.delim" = ","
                "skip.header.line.count" = 1    # Skip file headers
            }
        }
        columns {
            name    = "column_1"
            type    = "string"
        }
        columns {
            name    = "column_2"
            type    = "string"
        }
        columns {
            name    = "column_3"
            type    = "string"
        }
    }
}


来源:https://stackoverflow.com/questions/46452010/aws-athena-create-external-table-skipping-first-row

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!