SQOOP incremental import: how it handles the data when a row is deleted from the database?

你离开我真会死。 提交于 2019-12-11 03:54:31

问题


Suppose I have an employee table with column ( emp_id, emp_name, emp_age , emp_update_ts ), updat_ts field is auto updated to current timestamp every time if there is an update on the table. now my question is :

When I update/insert the row in the table and run incremental sqoop import with lastmodified it will make my database and the hdfs in sync but "what if the data is deleted from the table and then I run the sqoop incremental with lastmodified option ? will it also take care of deleting the data from the HDFS to keep them in sync Or there is any other way which can handle the delete situation ?


回答1:


SQOOP Incremental does not support "DELETE" per say unless the records are marked in the source system as "deleted". As per a very interesting document produced by Hortonworks explains.

So I use incremental for any tables that don't ever get deleted records. You would be surprised how many of these exists in most transactional databases.

When I do have tables that might have "DELETE" (and I have no indicator) I import the entire dataset in the staging area and then run a "diff" query in HIVE to identify the deleted records.

See my post on HQL query to do this "diff".




回答2:


SQOOP is just an import/export tool. It does not check for any deleted data. Once your import/export process is completed then you have to compare both db manually by writing query.

OR if your dataset is small, then you can import in csv through impala and check.




回答3:


You can try parsing MySQL binlog to monitor deleted rows if you are using MySQL. Then use MySQL binlog + sqoop to extract data incrementally.



来源:https://stackoverflow.com/questions/41380740/sqoop-incremental-import-how-it-handles-the-data-when-a-row-is-deleted-from-the

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!