How to delete and update a record in Hive

后端 未结 15 1288
梦如初夏
梦如初夏 2020-11-28 19:26

I have installed Hadoop, Hive, Hive JDBC. which are running fine for me. But I still have a problem. How to delete or update a single record using Hive because delete or upd

相关标签:
15条回答
  • 2020-11-28 19:46

    Delete has been recently added in Hive version 0.14 Deletes can only be performed on tables that support ACID Below is the link from Apache .

    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete

    0 讨论(0)
  • 2020-11-28 19:48

    Upcoming version of Hive is going to allow SET based update/delete handling which is of utmost importance when trying to do CRUD operations on a 'bunch' of rows instead of taking one row at a time.

    In the interim , I have tried a dynamic partition based approach documented here http://linkd.in/1Fq3wdb .

    Please see if it suits your need.

    0 讨论(0)
  • 2020-11-28 19:52

    You should not think about Hive as a regular RDBMS, Hive is better suited for batch processing over very large sets of immutable data.

    The following applies to versions prior to Hive 0.14, see the answer by ashtonium for later versions.

    There is no operation supported for deletion or update of a particular record or particular set of records, and to me this is more a sign of a poor schema.

    Here is what you can find in the official documentation:

    Hadoop is a batch processing system and Hadoop jobs tend to have high latency and
    incur substantial overheads in job submission and scheduling. As a result -
    latency for Hive queries is generally very high (minutes) even when data sets
    involved are very small (say a few hundred megabytes). As a result it cannot be
    compared with systems such as Oracle where analyses are conducted on a
    significantly smaller amount of data but the analyses proceed much more
    iteratively with the response times between iterations being less than a few
    minutes. Hive aims to provide acceptable (but not optimal) latency for
    interactive data browsing, queries over small data sets or test queries.
    
    Hive is not designed for online transaction processing and does not offer
    real-time queries and row level updates. It is best used for batch jobs over
    large sets of immutable data (like web logs).
    

    A way to work around this limitation is to use partitions: I don't know what you id corresponds to, but if you're getting different batches of ids separately, you could redesign your table so that it is partitioned by id, and then you would be able to easily drop partitions for the ids you want to get rid of.

    0 讨论(0)
  • 2020-11-28 19:55

    Good news,Insert updates and deletes are now possible on Hive/Impala using Kudu.

    You need to use IMPALA/kudu to maintain the tables and perform insert/update/delete records. Details with examples can be found here: insert-update-delete-on-hadoop

    Please share the news if you are excited.

    -MIK

    0 讨论(0)
  • 2020-11-28 19:58

    To achieve your current need, you need to fire below query

    > insert overwrite table student 
    > select *from student 
    > where id <> 1;
    

    This will delete current table and create new table with same name with all rows except the rows that you want to exclude/delete

    I tried this on Hive 1.2.1

    0 讨论(0)
  • 2020-11-28 20:02

    You can delete rows from a table using a workaround, in which you overwrite the table by the dataset you want left into the table as a result of your operation.

    insert overwrite table your_table 
        select * from your_table 
        where id <> 1
    ;
    

    The workaround is useful mostly for bulk deletions of easily identifiable rows. Also, obviously doing this can muck up your data, so a backup of the table is adviced and care when planning the "deletion" rule also adviced.

    0 讨论(0)
提交回复
热议问题