Sqoop Incremental Import

后端 未结 8 1719
别那么骄傲
别那么骄傲 2021-01-30 15:27

Need advice on Sqoop Incremental Imports. Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2,

8条回答
  •  没有蜡笔的小新
    2021-01-30 15:51

    let's take example here, you are having customer table with two columns cust_id and policy, also custid is your primary key and you just want to insert data cust id 100 onward

    scenario 1:- append new data on the basis of cust_id field

    phase1:-

    below 3 records are there which are inserted recently in customer table which we want to import in HDFS

    | custid | Policy |
    | 101 | 1 |
    | 102 | 2 | 
    | 103 | 3 |
    

    here is sqoop command for that

    sqoop import \ 
    --connect jdbc:mysql://localhost:3306/db \ 
    --username root -P \ 
    --table customer \ 
    --target-dir /user/hive/warehouse// \
    --append \
    --check-column custid \
    --incremental append \
    --last-value 100
    

    phase2:- below 4 records are there which are inserted recently in customer table which we want to import in HDFS

    | custid | Policy |
    | 104 | 4 |
    | 105 | 5 | 
    | 106 | 6 | 
    | 107 | 7 | 
    

    here is sqoop command for that

    sqoop import \ 
    --connect jdbc:mysql://localhost:3306/db \
    --username root -P \ 
    --table customer \ 
    --target-dir /user/hive/warehouse//
    \ --append \ --check-column custid \ --incremental append \ --last-value 103

    so these four properties we will have to cosider for inserting new records

    --append \
    --check-column  \
    --incremental append \
    --last-value 
    

    scenario 2:- append new data +update existing data on the basis of cust_id field

    below 1 new record with cust id 108 has inserted and cust id 101 and 102 has updated recently in customer table which we want to import in HDFS

    | custid | Policy |
    | 108 | 8 |
    | 101 | 11 | 
    | 102 | 12 | 
    
    sqoop import \ 
    --connect jdbc:mysql://localhost:3306/db \ 
    --username root -P \ 
    --table customer \ 
    --target-dir /user/hive/warehouse//
    \ --append \ --check-column custid \ --incremental lastmodified \ --last-value 107

    so these four properties we will have to cosider for insert/update records in same command

    --append \
    --check-column  \
    --incremental lastmodified \
    --last-value 
    

    I am specifically mentioning primary key as if table is not having primary key then few more properties needs to be consider which are:-

    multiple mapper perform the sqoop job by default so mapper need data to be split on the basis of some key so

    either we have to specifically define --m 1 option to say that only one mapper will perform this operation

    or we have to specify any other key (by using sqoop property --split-by ) through with you can uniquely identify the data then you can use

    提交回复
    热议问题