Delta/Incremental Load in Hive

◇◆丶佛笑我妖孽 提交于 2019-11-30 07:30:13

Updates are a notoriously difficult problem for any Hive-based system.

One typical approach is a two-step process

  1. Insert any data that has changed into one table. As you said, this will result in duplicates when rows are updated.
  2. Periodically overwrite a second table with "de-duplicated" data from the first table.

The second step is potentially painful, but there's really no way around it. At some level, you have to be overwriting, since Hive doesn't do in-place updating. Depending on your data, you may be able to partition the tables cleverly enough to avoid doing full overwrites, though. For example, if step 1 only inserts into a handful of partitions, then only those partitions need to be overwritten into the second table.

Also, depending on the access pattern, it can make sense to just have the second "de-duplicated" table be a view and not materialize it at all. Usually this just delays the pain to query time, though.

The only other way round this I've seen is using a very custom input and output format. Rather than explain it all, you can read about it here: http://pkghosh.wordpress.com/2012/07/08/making-hive-squawk-like-a-real-database/

Owen O'Malley has also been working on adding a version of this idea to standard Hive, but that's still in development: https://issues.apache.org/jira/browse/HIVE-5317

You can use a direct Map Reduce approach for bulk inset, update and delete. Details are here. It's essentially a merge and compact operation. A secondary sorting is performed on time stamp or sequence field either in the record or encoded in HDFS file names. The last version of a record from the reducer side join is emitted as output.

https://pkghosh.wordpress.com/2015/04/26/bulk-insert-update-and-delete-in-hadoop-data-lake/

We had quite similar problems inserting bulk data into our datalake. Since we were not in control of the data we had a tough time keeping the lake clean from duplicates. Notice this is not about updating records in hive but about avoiding duplications of the same record over again.

I created a pig script for this task:

CODATA = COGROUP HISTORICAL_DATA BY (key_col_1, key_col_2, ...),
                 DAILY_DATA_DISTINCT BY (key_col_1, key_col_2, ...);
CODATA_FILTERED = FILTER CODATA BY IsEmpty(HISTORICAL_DATA);
SET_DIFFERENCE = FOREACH CODATA_FILTERED GENERATE FLATTEN($2);
DUMMY = FILTER DAILY_DATA_DISTINCT BY $0=='';
DAILY_NEW_DATA = UNION DUMMY, SET_DIFFERENCE;

It builds the set difference. Apache DataFu SetDifference does the same but we were not able to use it inhouse.

I made one solution for delta load that contains a shell script and you just have to schedule your job which will gives you incrementally appended rows into your hive database. For the compelete solution, you have to follow this link -

https://bigdata-analytix.blogspot.com/2018/10/hive-incrementaldelta-load.html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!