Sqoop import : composite primary key and textual primary key

夙愿已清 提交于 2019-11-27 01:48:14

Specify split column manually. Split column is not necessarily equal to PK. You can have complex PK and some int Split column. You can specify any integer column or even simple function (some simple function like substring or cast, not aggregation or analytic). Split column preferably should be evenly distributed integer.

For example if your split column contains few rows with value -1 and 10M rows with values 10000 - 10000000 and num-mappers=8 then sqoop will split the dataset between mappers not evenly:

  • 1st mapper will get few rows with -1,
  • 2nd-7th mappers will get 0 rows,
  • 8th mapper will get almost 10M rows,

that will result in data skewing and 8th mapper will run for ever or even fail. And I have also got duplicates when used non-integer split-column with MS-SQL. So, use integer split-column. In your case with table with only two varchar columns you can either

(1) add surrogate int PK and use it also as a split or

(2) split your data manually using custom query with WHERE clause and run sqoop few times with num-mappers=1, or

(3) apply some deterministic Integer non-aggregation function to you varchar column, for example cast(substr(...) as int) or second(timestamp_col) or datepart(second, date), etc. as split-column.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!