Sqoop - Data splitting

后端 未结 2 400
小鲜肉
小鲜肉 2020-12-02 01:14

Sqoop able to import data from multiple tables using --query clause but not clear whether it is able to import below query.

Select deptid

2条回答
  •  一个人的身影
    2020-12-02 01:46

    There is some gap in your understanding.

    First of all, the degree of parallelism is controlled by -m or --num-mappers . By default value of --num-mappers is 4.

    Second, --split-by , will split your task on the basis of column-name.

    Third, $CONDITIONS, it is used internally by sqoop to achieve this splitting task.

    Example, You fired a query:

    sqoop import --connect jdbc:mysql://myserver:1202/ --username u1 --password p1 --query 'select * from emp where $CONDITIONS' --split-by empId --target-dir /temp/emp -m 4

    Say, my empId is uniformly distributed from 1- 100.

    Now, sqoop will take --split-by column and find its max and min value using query:

    SELECT MIN(empId), MAX(empId) FROM (Select * From emp WHERE (1 = 1) ) t1

    See it replaced $CONDITIONS with (1 = 1).

    In our case, min, max values are 1 and 100.

    As number of mappers are 4, sqoop will divide my query in 4 parts.

    Creating input split with lower bound 'empId >= 1' and upper bound 'empId < 25'

    Creating input split with lower bound 'empId >= 25' and upper bound 'empId < 50'

    Creating input split with lower bound 'empId >= 50' and upper bound 'empId < 75'

    Creating input split with lower bound 'empId >= 75' and upper bound 'empId <= 100'

    Now $CONDITIONS will again come into the picture. It is replaced by above range queries.

    First mapper will fire query like this:

    Select * From emp WHERE empId >= 25' AND 'empId < 50

    and so on for other 3 mappers.

    Results from all the mappers is aggregated and written to a final HDFS directory.

    Regarding your query :

    select deptid, avg(salary) from emp group by deptid

    you will specify

    --query 'select deptid, avg(salary) from emp group by deptid where $CONDITIONS'

    It will be first converted to

    select deptid, avg(salary) from emp group by deptid where (1 = 0)

    to fetch column metadata.

    I believe this query won't run in RDBMS. Try above query(having Where (1 = 0)) directly in Mysql.

    So you will not be able to use this query to fetch data using Sqoop.

    Sqoop is used for simpler SQL queries.

提交回复
热议问题