What is the difference between --split-by and --boundary-query in SQOOP?

问题

Assuming we don't have a column where values are equally distributed, let's say we have a command like this:

sqoop import \
...
--boundary-query "SELECT min(id), max(id) from some_table"
--split-by id
...

What's the point using --boundary-query here while --split-by does the same thing? Is there any other way to use --boundary-query? Or any other way to split data more efficiently when there is no key(unique) column?

回答1:

--split-by id will split your data uniformly on the basis of number of mappers (default 4).

Now boundary query by default is something like this.

--boundary-query "SELECT min(id), max(id) from some_table"

But if you know id starts from val1 and ends with val2. Then there is no point to calculate min() and max() operations. This will make sqoop command execution faster.

You can specify any arbitrary query returning val1 and val2.

Edit:

Right now (1.4.7) there is no way in sqoop to specify uneven partitions for splitting.

For example, you have data like:

1,2,3,51,52,191,192,193,194,195,196,197,198,199,200

If you defined 4 mappers in the command. It will check min and max which is 1 and 200 in our case.

Then it will split it into 4 parts:

Yes, in this 3rd mapper(101-150) will get nothing from the RDBMS table.

But there is no way to define custom partition like :

1-10
51-60
190-200

For large data (billions of rows), practically it is not suitable to find exact values like this or use another tool to find data pattern first and then prepare custom partitions.

回答2:

--split-by For free-form query imports, you need to specify 'split-by' . When you are importing the result of any particular query, sqoop needs to know the column-name using which it will create splits. Whereas, while importing tables, if not specified, it uses the primary key of the table being imported for creating splits. In case your primary key is uneven and not consistent, you can also specify any other column using split-by.

--boundary-query During sqoop import process, it uses this query to calculate the boundary for creating splits: select min(), max() from table_name.

In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument. This saves min(split-by) and max(split-by) operations and thereby is more efficient.

回答3:

I did not read from the answers what I was expecting.

--split-by:

I would say --split-by is mostly being used when you have a table that hasn't got a primary key, sqoop will normally spit out an error message if the table has no primary key. --split-by is being used to determine another column to be used to compute the min() & max in the absence of a pk. Some requirements are:

Table should have numeric values
Table should not contain null
etc

use --split-by only on indexed columns for performance reasons. In case you have to import data from multiple tables, its naturally hard to inspect all those tables to see which ones have primary keys and which don't have primary keys.Here, u use --autoreset-to-one-mapper alongside the mappers used with -m #of_mappers. so your command will look like this:

sqoop-import --connect jdbc:mysql://localhost/dbName --table sometable --username uname --warehouse-dir whdir --autoreset-to-one-mapper -m 5

Tables containing no primary key will use one thread ( sequentially ) and those with primary keys will use 5 threads or mappers as specified. You cannot use both --autoreset-to-one-mapper and -split-by in one command.

--boundary-queries:

If you know the min and value on a table, you can waive complex default computation to get them, you can simple hardcode them as argument to boundary queries. @burakongun explained this well.

来源：https://stackoverflow.com/questions/40838036/what-is-the-difference-between-split-by-and-boundary-query-in-sqoop

标签

split

sqoop

boundary