What happens when a hive insert is failed halfway?

问题

Suppose an insert is expected to load 100 records in hive and 40 records have been inserted and the insert failed for some reason. will the transaction roll back completely, undoing 40 records which were inserted? or Will we see 40 records in the hive table even after the insert query failed?

回答1:

The operation is atomic (even for non-ACID table): If you inserting or rewriting data using HiveQL, it writes data into temporary location and only if the command succeeds files are moved to the table location (old files are deleted in case of INSERT OVERWRITE). If SQL statement fails the data remains as it was before statement execution.

Note about S3 direct writes: Direct writes to S3 feature should be disabled to allow Hive to write to temporary location and rewrite target folder only if operation succeeded:

-- Disable AWS S3 direct writes:
set hive.allow.move.on.s3=true;

Read also this documentation for more details on which ACID features supported in concurrency mode and limitations: What is ACID and why should you use it?

Up until Hive 0.13, atomicity, consistency, and durability were provided at the partition level. Isolation could be provided by turning on one of the available locking mechanisms (ZooKeeper or in memory). With the addition of transactions in Hive 0.13 it is now possible to provide full ACID semantics at the row level, so that one application can add rows while another reads from the same partition without interfering with each other.

Also read this about Hive locks with ACID enabled (transactional and non-transactional tables)

Important addition about S3 eventual consistency. On S3 files are immediately consistent after create and eventually consistent after delete or overwrite. You can easily solve the problem with consistency (extremely reduce the probability of eventual consistency issue) using timestamp based partition folders or filenames or GUID prefixed filenames. Qubole provides additional configuration parameters for prefixing files with GUID, this helps to eliminate the issue with eventual consistency because each time you are writing new files with new GUID prefix, files with different GUID are removed:

set hive.qubole.dynpart.use.prefix=true;
set hive.qubole.dynpart.bulk.delete=true;

If you do not use Qubole, you can create partitions with location containing timestamp. If you drop partition in Hive and create new with new timestamp location, you can completely eliminate problem with eventual consistency because you do not rewrite files, location is different, and when you drop previous location, does not matter when drop will become consistent, that location is not mounted in Hive any more. This requires additional partition manipulation. For small tables you can ignore this issue. Also keep the number of files low per partition, this will help to reduce the time when data become consistent.

See also these related answers about eventual consistency in S3:

https://stackoverflow.com/a/58706140/2700344

https://stackoverflow.com/a/56192799/2700344

https://stackoverflow.com/a/42677748/2700344

来源：https://stackoverflow.com/questions/63355982/what-happens-when-a-hive-insert-is-failed-halfway

标签

Hive