I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or upd
UPDATE
or DELETE
a record isn't allowed in Hive, but INSERT INTO
is acceptable.
A snippet from Hadoop: The Definitive Guide(3rd edition):
Updates, transactions, and indexes are mainstays of traditional databases. Yet, until recently, these features have not been considered a part of Hive's feature set. This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. For a data warehousing application that runs over large portions of the dataset, this works well.
Hive doesn't support updates (or deletes), but it does support INSERT INTO, so it is possible to add new rows to an existing table.
Yes, rightly said. Hive does not support UPDATE option. But the following alternative could be used to achieve the result:
Update records in a partitioned Hive table
:
Join the two tables (main & staging tables) using a LEFT OUTER JOIN
operation as below:
insert overwrite table main_table partition (c,d)
select t2.a, t2.b, t2.c,t2.d from staging_table t2 left outer join main_table t1 on t1.a=t2.a;
In the above example, the main_table
& the staging_table
are partitioned using the (c,d)
keys. The tables are joined via a LEFT OUTER JOIN
and the result is used to OVERWRITE
the partitions in the main_table
.
A similar approach could be used in the case of un-partitioned Hive table
UPDATE
operations too.
If you want to delete all records then as a workaround load an empty file into table in OVERWRITE mode
hive> LOAD DATA LOCAL INPATH '/root/hadoop/textfiles/empty.txt' OVERWRITE INTO TABLE employee;
Loading data to table default.employee
Table default.employee stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.19 seconds
hive> SELECT * FROM employee;
OK
Time taken: 0.052 seconds