FetchMode Join vs SubSelect

问题

I have two tables Employee and Department following are the entity classes for both of them

Department.java
@Entity
@Table(name = "DEPARTMENT")
public class Department {
    @Id
    @Column(name = "DEPARTMENT_ID")
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Integer departmentId;
    @Column(name = "DEPARTMENT_NAME")
    private String departmentName;
    @Column(name = "LOCATION")
    private String location;

    @OneToMany(cascade = CascadeType.ALL, mappedBy = "department", orphanRemoval = true)
    @Fetch(FetchMode.SUBSELECT)
    //@Fetch(FetchMode.JOIN)
    private List<Employee> employees = new ArrayList<>();
}


Employee.java
@Entity
@Table(name = "EMPLOYEE")
public class Employee {
    @Id
    @SequenceGenerator(name = "emp_seq", sequenceName = "seq_employee")
    @GeneratedValue(generator = "emp_seq")
    @Column(name = "EMPLOYEE_ID")
    private Integer employeeId;
    @Column(name = "EMPLOYEE_NAME")
    private String employeeName;

    @ManyToOne
    @JoinColumn(name = "DEPARTMENT_ID")
    private Department department;
}

Below are the queries fired when I did em.find(Department.class, 1);

-- fetch mode = fetchmode.join

    SELECT department0_.DEPARTMENT_ID AS DEPARTMENT_ID1_0_0_,
      department0_.DEPARTMENT_NAME    AS DEPARTMENT_NAME2_0_0_,
      department0_.LOCATION           AS LOCATION3_0_0_,
      employees1_.DEPARTMENT_ID       AS DEPARTMENT_ID3_1_1_,
      employees1_.EMPLOYEE_ID         AS EMPLOYEE_ID1_1_1_,
      employees1_.EMPLOYEE_ID         AS EMPLOYEE_ID1_1_2_,
      employees1_.DEPARTMENT_ID       AS DEPARTMENT_ID3_1_2_,
      employees1_.EMPLOYEE_NAME       AS EMPLOYEE_NAME2_1_2_
    FROM DEPARTMENT department0_
    LEFT OUTER JOIN EMPLOYEE employees1_
    ON department0_.DEPARTMENT_ID   =employees1_.DEPARTMENT_ID
    WHERE department0_.DEPARTMENT_ID=?

-- fetch mode = fetchmode.subselect

    SELECT department0_.DEPARTMENT_ID AS DEPARTMENT_ID1_0_0_,
      department0_.DEPARTMENT_NAME    AS DEPARTMENT_NAME2_0_0_,
      department0_.LOCATION           AS LOCATION3_0_0_
    FROM DEPARTMENT department0_
    WHERE department0_.DEPARTMENT_ID=?

    SELECT employees0_.DEPARTMENT_ID AS DEPARTMENT_ID3_1_0_,
      employees0_.EMPLOYEE_ID        AS EMPLOYEE_ID1_1_0_,
      employees0_.EMPLOYEE_ID        AS EMPLOYEE_ID1_1_1_,
      employees0_.DEPARTMENT_ID      AS DEPARTMENT_ID3_1_1_,
      employees0_.EMPLOYEE_NAME      AS EMPLOYEE_NAME2_1_1_
    FROM EMPLOYEE employees0_
    WHERE employees0_.DEPARTMENT_ID=?

I just wanted to know which one should we prefer FetchMode.JOIN or FetchMode.SUBSELECT? which one should we opt in which scenario?

回答1:

The SUBQUERY strategy that Marmite refers to is related to FetchMode.SELECT, not SUBSELECT.

The console output that you've posted about fetchmode.subselect is curious because this is not the way that is supposed to work.

The FetchMode.SUBSELECT

use a subselect query to load the additional collections

Hibernate docs:

If one lazy collection or single-valued proxy has to be fetched, Hibernate will load all of them, re-running the original query in a subselect. This works in the same way as batch-fetching but without the piecemeal loading.

FetchMode.SUBSELECT should look something like this:

SELECT <employees columns>
FROM EMPLOYEE employees0_
WHERE employees0_.DEPARTMENT_ID IN
(SELECT department0_.DEPARTMENT_ID FROM DEPARTMENT department0_)

You can see that this second query will bring to memory all the employees that belongs to some departament (i.e. employee.department_id is not null), it doesn't matter if it is not the department that you retrieve in your first query. So this is potentially a major issue if the table of employees is large because it may be accidentially loading a whole database into memory.

However, FetchMode.SUBSELECT reduces significatly the number of queries because takes only two queries in comparisson to the N+1 queries of the FecthMode.SELECT.

You may be thinking that FetchMode.JOIN makes even less queries, just 1, so why use SUBSELECT at all? Well, it's true but at the cost of duplicated data and a heavier response.

If a single-valued proxy has to be fetched with JOIN, the query may retrieve:

+---------------+---------+-----------+
| DEPARTMENT_ID | BOSS_ID | BOSS_NAME |
+---------------+---------+-----------+
|             1 |       1 | GABRIEL   |
|             2 |       1 | GABRIEL   |
|             3 |       2 | ALEJANDRO |
+---------------+---------+-----------+

The employee data of the boss is duplicated if he directs more than one department and it has a cost in bandwith.

If a lazy collection has to be fetched with JOIN, the query may retrieve:

+---------------+---------------+-------------+
| DEPARTMENT_ID | DEPARTMENT_ID | EMPLOYEE_ID |
+---------------+---------------+-------------+
|             1 | Sales         | GABRIEL     |
|             1 | Sales         | ALEJANDRO   |
|             2 | RRHH          | DANILO      |
+---------------+---------------+-------------+

The department data is duplicated if it contains more than one employee (the natural case). We don't only suffer a cost in bandwidth but also we get duplicate duplicated Department objects and we must use a SET or DISTINCT_ROOT_ENTITY to de-duplicate.

However, duplicate data in pos of a lower latency is a good trade off in many cases, like Markus Winand says.

An SQL join is still more efficient than the nested selects approach—even though it performs the same index lookups—because it avoids a lot of network communication. It is even faster if the total amount of transferred data is bigger because of the duplication of employee attributes for each sale. That is because of the two dimensions of performance: response time and throughput; in computer networks we call them latency and bandwidth. Bandwidth has only a minor impact on the response time but latencies have a huge impact. That means that the number of database round trips is more important for the response time than the amount of data transferred.

So, the main issue about using SUBSELECT is that is hard to control and may be loading a whole graph of entities into memory. With Batch fetching you fetch the associated entity in a separate query as SUBSELECT (so you don't suffer duplicates), gradually and most important you query only related entities (so you don't suffer from potentially load a huge graph) because the IN subquery is filtered by the IDs retrieved by the outter query).

Hibernate: 
    select ...
    from mkyong.stock stock0_

Hibernate: 
    select ...
    from mkyong.stock_daily_record stockdaily0_ 
    where
        stockdaily0_.STOCK_ID in (
            ?, ?, ?, ?, ?, ?, ?, ?, ?, ?
        )

(It may be interesting test if Batch fetching with a very high batch size would act like a SUBSELECT but without the issue of load the whole table)

A couple of posts showing the different fetching strategies and the SQL logs (very important):

Hibernate – fetching strategies examples
Hibernate FetchMode explained by example
Investigating Hibernate fetch strategies – A tutorial

Summary:

JOIN: avoids the major issue of N+1 queries but it may retrieve data duplicated.
SUBSELECT: avoids N+1 too and doesn't duplicate data but it loads all the entities of the associated type into memory.

The tables were built using ascii-tables.

回答2:

I'd say it depends...

Let assume you have N employees in a department, that contains D bytes of information and an average employee consist of E bytes. (Bytes are sum of the attribute length with some overhead).

Using the join strategy you perform 1 query and transfers N * (D + E) data.

Using the subquery strategy you perform 1 + N queries, but transfers only D + N*E data.

Typically the N+1 query is the NO GO if the N is large, so the JOIN is preferred.

But actually you must check your mileage between number of queries and data transfer.

Note that I'm not considering other aspects as Hibernate caching.

Additional subtle aspect could be valid if the employee table is large and partitioned - partition pruning on the index access comes to the consideration as well.

回答3:

A customer (financial services) of mine had a similar problem, and he wanted to "acquire the data in a single query". Well, I explained that it is better to have more than one query, because of the following:

For FetchMode.JOIN the department would be transferred from the database to the application once per employee, because the join operation results in multiplying the department per employee. If you have 10 departments with 100 employees each, every of these 10 departments would be transferred 100 times within one query, simple SQL. So each department, in this case, is transferred 99 times more often than necessary, causing a data-transfer-overhead for the department.

For Fetchmode SUBSELECT two queries are fired to the database. One would be used to get the data of the 1000 employes, one to get the 10 departments. This, for me, sounds much more efficient. For sure you would make sure that indices are in place so that data can be retrieved immediately.

I would prefer FetchMode.SUBSELECT.

It would be another case if each department has only one employee, but, as the name "department" suggests, this would very unlikely be the case.

I suggest measuring of the access times to support this theory. For my customer I did measurements for different types of accesses, and the "department" table for my customer had many more fields (I did not design it, though). So it was soon clearly evident that the FetchMode.SUBSELECT was much faster.

回答4:

Planky said

(1) This is grossly misleading. (2) The subselect will not fetch your whole database into memory. The linked article is about a quirk where subselect (3) ignores paging commands from the parent, (4) but it is still a subselect.

After your comment I investigated again about FetchMode.SUBSELECT and I found out that my answer is not entirely correct.
This was an hypothetical situation where the hydration of each entity that was being entirely loaded into memory (Employee in this case) will ending hydrating many other entities. The true problem is loading the entire table being sub-selected if that table contains thousands of rows (even if each one of those doesn't fetch eagerly other entities from other tables).
I don't know what you mean with paging commands from the parent.
Yes, it is still a subselect, but I don't know what you are trying to point out with this.

The console output that you've posted about fetchmode.subselect is curious because this is not the way that is supposed to work.

This is true but only when there is more than Department entity hidrated (what means more than one employees collection uninitialized), I've tested it with 3.6.10.Final and 4.3.8.Final In scenarios 2.2 (FetchMode.SUBSELECT hidrating 2 of 3 Departments) and 3.2 (FetchMode.SUBSELECT hidrating all Departments), SubselectFetch.toSubselectString returns the following (the links to Hibernate classes are taken from the 4.3.8.Final tag):

select this_.DEPARTMENT_ID from SUBSELECT_DEPARTMENT this_

This subquery is after used to build the where clause by OneToManyJoinWalker.initStatementString ending with

employees0_.DEPARTMENT_ID in (select this_.DEPARTMENT_ID from SUBSELECT_DEPARTMENT this_)

Then the where clause is added in CollectionJoinWalker.whereString ending with

select employees0_.DEPARTMENT_ID as DEPARTMENT3_2_1_, employees0_.EMPLOYEE_ID as EMPLOYEE1_1_, employees0_.EMPLOYEE_ID as EMPLOYEE1_3_0_, employees0_.DEPARTMENT_ID as DEPARTMENT3_3_0_, employees0_.EMPLOYEE_NAME as EMPLOYEE2_3_0_ from SUBSELECT_EMPLOYEE employees0_ where employees0_.DEPARTMENT_ID in (select this_.DEPARTMENT_ID from SUBSELECT_DEPARTMENT this_)

Whit this query, in both cases all Employees are being retrieved and hydrated. This is clearly an issue in scenario 2.2 because we are hydrating only Departments 1 and 2 but also hydrating all Employees even if they don't belong to those Departments (in this case Employees of Department 3).

If there is only one Department entity hydrated in the session with its employees collection uninitialized, then the query is like the one eatSleepCode wrote. Check scenario 1.2

select subselectd0_.department_id as departme1_2_0_, subselectd0_.department_name as departme2_2_0_, subselectd0_.location as location3_2_0_ from subselect_department subselectd0_ where subselectd0_.department_id=?

From FetchStyle

    /**
     * Performs a separate SQL select to load the indicated data.  This can either be eager (the second select is
     * issued immediately) or lazy (the second select is delayed until the data is needed).
     */
    SELECT,
    /**
     * Inherently an eager style of fetching.  The data to be fetched is obtained as part of an SQL join.
     */
    JOIN,
    /**
     * Initializes a number of indicated data items (entities or collections) in a series of grouped sql selects
     * using an in-style sql restriction to define the batch size.  Again, can be either eager or lazy.
     */
    BATCH,
    /**
     * Performs fetching of associated data (currently limited to only collections) based on the sql restriction
     * used to load the owner.  Again, can be either eager or lazy.
     */
    SUBSELECT

~~Until now, I couldn't resolve what this Javadoc means with:~~

based on the sql restriction used to load the owner

UPDATE Planky said:

Instead, it's just going to load the table at the worst, and even then, only if your initial query didn't have a where clause. So I would say that using subselect queries can unexpectedly load the whole table if you're LIMITing the results and you don't have any WHERE criteria.

This is true and it is a very important detail that I've tested in the new scenario 4.2

The query generated to fetch employees is

select employees0_.department_id as departme3_4_1_, employees0_.employee_id as employee1_5_1_, employees0_.employee_id as employee1_5_0_, employees0_.department_id as departme3_5_0_, employees0_.employee_name as employee2_5_0_ from subselect_employee employees0_ where employees0_.department_id in (select this_.department_id from subselect_department this_ where this_.department_name>=?)

The subquery inside the where clause contains the original restriction this_.department_name>=?, avoiding the load of all Employees. This is what the javadoc means with

based on the sql restriction used to load the owner

All what I've said about FetchMode.JOIN and the differences with FetchMode.SUBSELECT remains true (and also applies for FetchMode.SELECT).

来源：https://stackoverflow.com/questions/32984799/fetchmode-join-vs-subselect

标签

hibernate

jpa

join

sql-subselect