How to delete and update a record in Hive

后端未结

关注

 15  1325

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or upd

相关标签:

15条回答

心在旅途

2020-11-28 19:46

Delete has been recently added in Hive version 0.14 Deletes can only be performed on tables that support ACID Below is the link from Apache .

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-11-28 19:48

Upcoming version of Hive is going to allow SET based update/delete handling which is of utmost importance when trying to do CRUD operations on a 'bunch' of rows instead of taking one row at a time.

In the interim , I have tried a dynamic partition based approach documented here http://linkd.in/1Fq3wdb .

Please see if it suits your need.

0 讨论(0)
发布评论:

提交评论
- 加载中...

我在风中等你

2020-11-28 19:52

You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.

The following applies to versions prior to Hive 0.14, see the answer by ashtonium for later versions.

There is no operation supported for deletion or update of a particular record or particular set of records, and to me this is more a sign of a poor schema.

Here is what you can find in the official documentation:

Hadoop is a batch processing system and Hadoop jobs tend to have high latency and
incur substantial overheads in job submission and scheduling. As a result -
latency for Hive queries is generally very high (minutes) even when data sets
involved are very small (say a few hundred megabytes). As a result it cannot be
compared with systems such as Oracle where analyses are conducted on a
significantly smaller amount of data but the analyses proceed much more
iteratively with the response times between iterations being less than a few
minutes. Hive aims to provide acceptable (but not optimal) latency for
interactive data browsing, queries over small data sets or test queries.

Hive is not designed for online transaction processing and does not offer
real-time queries and row level updates. It is best used for batch jobs over
large sets of immutable data (like web logs).

A way to work around this limitation is to use partitions: I don't know what you id corresponds to, but if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.

0 讨论(0)

隐瞒了意图╮

2020-11-28 19:55

Good news,Insert updates and deletes are now possible on Hive/Impala using Kudu.

You need to use IMPALA/kudu to maintain the tables and perform insert/update/delete records. Details with examples can be found here: insert-update-delete-on-hadoop

Please share the news if you are excited.

-MIK

0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-11-28 19:58
To achieve your current need, you need to fire below query
```
> insert overwrite table student 
> select *from student 
> where id <> 1;
```
This will delete current table and create new table with same name with all rows except the rows that you want to exclude/delete

I tried this on Hive 1.2.1
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-11-28 20:02
You can delete rows from a table using a workaround, in which you overwrite the table by the dataset you want left into the table as a result of your operation.
```
insert overwrite table your_table 
    select * from your_table 
    where id <> 1
;
```
The workaround is useful mostly for bulk deletions of easily identifiable rows. Also, obviously doing this can muck up your data, so a backup of the table is adviced and care when planning the "deletion" rule also adviced.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页