Vertica performance degradation while loading parquet files over delimited files from s3 to vertica

瘦欲@ 提交于 2019-12-11 14:10:16

问题


I have parquet files for 2 Billion records with GZIP compression and the same data with SNAPPY compression. Also, I have Delimited files for the same 2 Billion records. We have 72 Vertica nodes in AWS prod, we are seeing a huge performance spike for parquet files while moving data from s3 to Vertica with COPY command than Delimited files. Parquet takes 7x more time than Delimited files eventhough delimited file size is 50X more than parquet.

Below are the stats for the test we conducted.

Total file sizes are

Parquet GZIP - 6 GB

Parquet Snappy - 9.2 GB

Delimited - 450GB

Below are the copy command used for both Parquet and Delimited. We did see some 2 mins improvement when we removed "No commit" in the copy query.

Delimited files

COPY schema.table1 ( col1,col2,col3,col4,col5 ) FROM 's3://path_to_delimited_s3/*' DELIMITER E'\001' NULL AS '\N' NO ESCAPE ABORT ON ERROR DIRECT NO COMMIT;

Parquet files

COPY schema.table2 (col1,col2,col3,col4,col5 ) FROM 's3://path_to_parquet_s3/*' PARQUET ABORT ON ERROR DIRECT NO COMMIT;

We are surprised to see this spike w.r.t parquet files, Is this expected for parquet copy? Any pointers, thoughts will be really helpful.

Thanks


回答1:


Without knowing anything more it's difficult to answer. You should, again, monitor LOAD_STREAMS for finding out what's going on.

One reason could be that the various files in s3://path_to_parquet_s3/* for the CSV version are optimally distributed between the nodes of your load process, therefore enhancing the parallelism considerably.

To count the number of parsing threads - while your load is running, find your running LOAD_STREAM (WHERE is_executing ...), and then join that with LOAD_SOURCES - USING(transaction_id,statement_id).



来源:https://stackoverflow.com/questions/58630873/vertica-performance-degradation-while-loading-parquet-files-over-delimited-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!