Need advice on Sqoop Incremental Imports.
Say I have a Customer with Policy 1 on Day 1 and I imported those records in HDFS on Day 1 and I see them in Part Files.
On Day 2,
let's take example here, you are having customer table with two columns cust_id and policy, also custid is your primary key and you just want to insert data cust id 100 onward
scenario 1:- append new data on the basis of cust_id field
phase1:-
below 3 records are there which are inserted recently in customer table which we want to import in HDFS
| custid | Policy |
| 101 | 1 |
| 102 | 2 |
| 103 | 3 |
here is sqoop command for that
sqoop import \
--connect jdbc:mysql://localhost:3306/db \
--username root -P \
--table customer \
--target-dir /user/hive/warehouse// \
--append \
--check-column custid \
--incremental append \
--last-value 100
phase2:-
below 4 records are there which are inserted recently in customer table which we want to import in HDFS
| custid | Policy |
| 104 | 4 |
| 105 | 5 |
| 106 | 6 |
| 107 | 7 |
here is sqoop command for that
sqoop import \
--connect jdbc:mysql://localhost:3306/db \
--username root -P \
--table customer \
--target-dir /user/hive/warehouse// \
--append \
--check-column custid \
--incremental append \
--last-value 103
so these four properties we will have to cosider for inserting new records
--append \
--check-column \
--incremental append \
--last-value
scenario 2:- append new data +update existing data on the basis of cust_id field
below 1 new record with cust id 108 has inserted and cust id 101 and 102 has updated recently in customer table which we want to import in HDFS
| custid | Policy |
| 108 | 8 |
| 101 | 11 |
| 102 | 12 |
sqoop import \
--connect jdbc:mysql://localhost:3306/db \
--username root -P \
--table customer \
--target-dir /user/hive/warehouse// \
--append \
--check-column custid \
--incremental lastmodified \
--last-value 107
so these four properties we will have to cosider for insert/update records in same command
--append \
--check-column \
--incremental lastmodified \
--last-value
I am specifically mentioning primary key as if table is not having primary key then few more properties needs to be consider which are:-
multiple mapper perform the sqoop job by default so mapper need data to be split on the basis of some key so
either we have to specifically define --m 1 option to say that only one mapper will perform this operation
or we have to specify any other key (by using sqoop property --split-by ) through with you can uniquely identify the data then you can use
- 热议问题