问题
I am trying to read csv data from s3 bucket and creating a table in AWS Athena. My table when created was unable to skip the header information of my CSV file.
Query Example :
CREATE EXTERNAL TABLE IF NOT EXISTS table_name ( `event_type_id`
string, `customer_id` string, `date` string, `email` string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH
SERDEPROPERTIES ( "separatorChar" = "|", "quoteChar" = "\"" )
LOCATION 's3://location/'
TBLPROPERTIES ("skip.header.line.count"="1");
skip.header.line.count doesn't seem to work. But this does not work out. I think Aws has some issue with this.Is there any other way that I could get through this?
回答1:
This is what works in Redshift:
You want to use table properties ('skip.header.line.count'='1')
Along with other properties if you want, e.g. 'numRows'='100'
.
Here's a sample:
create external table exreddb1.test_table
(ID BIGINT
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');
回答2:
This is a known deficiency.
The best method I've seen was tweeted by Eric Hammond:
...WHERE date NOT LIKE '#%'
This appears to skip header lines during a Query. I'm not sure how it works, but it might be a method for skipping NULLs.
回答3:
As of today (2019-11-18), the query from the OP seems to work. i.e. skip.header.line.count
is honored and the first line is indeed skipped.
来源:https://stackoverflow.com/questions/45488792/how-to-skip-headers-when-we-are-reading-data-from-a-csv-file-in-s3-and-creating