Optimize a query that group results by a field from the joined table

问题

I've got a very simple query that have to group the results by the field from the joined table:

SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p ON s.id = p.sales_id 
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'
GROUP BY p.name

Table ycs_products is actually sales_products, lists products in each sale. I want to see the share of each product sold over a period of time.

The current query speed is 2 seconds which is too much for the user interaction. I need to make this query run fast. Is there a way to get rid of Using temporary without denormalization?

The join order is critically important, there is a lot of data in both tables and limiting the number of records by date is unquestionable prerequisite.

here goes the Explain result

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
         type: range
possible_keys: PRIMARY,dtm
          key: dtm
      key_len: 6
          ref: NULL
         rows: 1164728
        Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: p
         type: ref
possible_keys: sales_id
          key: sales_id
      key_len: 5
          ref: test.s.id
         rows: 1
        Extra: 
2 rows in set (0.00 sec)

and the same in json

EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "filesort": {
      "sort_key": "p.`name`",
      "temporary_table": {
        "table": {
          "table_name": "s",
          "access_type": "range",
          "possible_keys": ["PRIMARY", "dtm"],
          "key": "dtm",
          "key_length": "6",
          "used_key_parts": ["dtm"],
          "rows": 1164728,
          "filtered": 100,
          "attached_condition": "s.dtm between '2018-02-16 00:00:00' and '2018-02-22 23:59:59'",
          "using_index": true
        },
        "table": {
          "table_name": "p",
          "access_type": "ref",
          "possible_keys": ["sales_id"],
          "key": "sales_id",
          "key_length": "5",
          "used_key_parts": ["sales_id"],
          "ref": ["test.s.id"],
          "rows": 1,
          "filtered": 100
        }
      }
    }
  }
}

as well as create tables though I find it unecessary

    CREATE TABLE `ycs_sales` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `dtm` datetime DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `dtm` (`dtm`)
    ) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1
    CREATE TABLE `ycs_products` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `sales_id` int(11) DEFAULT NULL,
      `name` varchar(255) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `sales_id` (`sales_id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=2332802 DEFAULT CHARSET=latin1

And also a PHP code to replicate the test environment

#$pdo->query("set global innodb_flush_log_at_trx_commit = 2");
$pdo->query("create table ycs_sales (id int auto_increment primary key, dtm datetime)");
$stmt = $pdo->prepare("insert into ycs_sales values (null, ?)");
foreach (range(mktime(0,0,0,2,1,2018), mktime(0,0,0,2,28,2018)) as $stamp){
    $stmt->execute([date("Y-m-d", $stamp)]);
}
$max_id = $pdo->lastInsertId();
$pdo->query("alter table ycs_sales add key(dtm)");

$pdo->query("create table ycs_products (id int auto_increment primary key, sales_id int, name varchar(255))");
$stmt = $pdo->prepare("insert into ycs_products values (null, ?, ?)");
$products = ['food', 'drink', 'vape'];
foreach (range(1, $max_id) as $id){
    $stmt->execute([$id, $products[rand(0,2)]]);
}
$pdo->query("alter table ycs_products add key(sales_id)");

回答1:

The problem is that grouping by name makes you lose the sales_id information, therefore MySQL is forced to use a temporary table.

Although it's not the cleanest of the solutions, and one of my less favorites approach, you could add a new index, on both the name and the sales_id columns, like:

ALTER TABLE `yourdb`.`ycs_products` 
ADD INDEX `name_sales_id_idx` (`name` ASC, `sales_id` ASC);

and force the query to use this index, with either force index or use index:

SELECT SQL_NO_CACHE p.name, COUNT(1) FROM ycs_sales s
INNER JOIN ycs_products p use index(name_sales_id_idx) ON s.id = p.sales_id 
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'
GROUP BY p.name;

My execution reported only "using where; using index" on the table p and "using where" on the table s.

Anyway, I strongly suggest you to re-think about your schema, because probably you might find some better design for this two tables. On the other hand, if this is not a critical part of your application, you can deal with the "forced" index.

EDIT

Since it's quite clear that the problem is in the design, I suggest drawing the relationships as a many-to-many. If you have chance to verify it into your testing environment, here's what I would do:

1) Create a temporary table just to store name and id of the product:

create temporary table tmp_prods
select min(id) id, name
from ycs_products
group by name;

2) Starting from the temporary table, join the sales table to create a replacement for the ycs_product:

create table ycs_products_new
select * from tmp_prods;

ALTER TABLE `poc`.`ycs_products_new` 
CHANGE COLUMN `id` `id` INT(11) NOT NULL ,
ADD PRIMARY KEY (`id`);

3) Create the join table:

CREATE TABLE `prod_sale` (
`prod_id` INT(11) NOT NULL,
`sale_id` INT(11) NOT NULL,
PRIMARY KEY (`prod_id`, `sale_id`),
INDEX `sale_fk_idx` (`sale_id` ASC),
CONSTRAINT `prod_fk`
  FOREIGN KEY (`prod_id`)
  REFERENCES ycs_products_new (`id`)
  ON DELETE NO ACTION
  ON UPDATE NO ACTION,
CONSTRAINT `sale_fk`
  FOREIGN KEY (`sale_id`)
  REFERENCES ycs_sales (`id`)
  ON DELETE NO ACTION
  ON UPDATE NO ACTION);

and fill it with the existing values:

insert into prod_sale (prod_id, sale_id)
select tmp_prods.id, sales_id from ycs_sales s
inner join ycs_products p
on p.sales_id=s.id
inner join tmp_prods on tmp_prods.name=p.name;

Finally, the join query:

select name, count(name) from ycs_products_new p
inner join prod_sale ps on ps.prod_id=p.id
inner join ycs_sales s on s.id=ps.sale_id 
WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59'
group by p.id;

Please, note that the group by is on the primary key, not the name.

Explain output:

explain select name, count(name) from ycs_products_new p inner join prod_sale ps on ps.prod_id=p.id inner join ycs_sales s on s.id=ps.sale_id  WHERE s.dtm BETWEEN '2018-02-16 00:00:00' AND  '2018-02-22 23:59:59' group by p.id;
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
| id   | select_type | table | type   | possible_keys       | key     | key_len | ref             | rows | Extra       |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+
|    1 | SIMPLE      | p     | index  | PRIMARY             | PRIMARY | 4       | NULL            |    3 |             |
|    1 | SIMPLE      | ps    | ref    | PRIMARY,sale_fk_idx | PRIMARY | 4       | test.p.id       |    1 | Using index |
|    1 | SIMPLE      | s     | eq_ref | PRIMARY,dtm         | PRIMARY | 4       | test.ps.sale_id |    1 | Using where |
+------+-------------+-------+--------+---------------------+---------+---------+-----------------+------+-------------+

回答2:

Why have an id for ycs_products? It seems like the sales_id should be the PRIMARY KEY of that table?

If that is possible, it eliminates the performance problem by getting rid of the issues brought up by senape.

If, instead, there are multiple rows for each sales_id, then changing the secondary index to this would help:

INDEX(sales_id, name)

Another thing to check on is innodb_buffer_pool_size. It should be about 70% of available RAM. This would improve the cacheablity of data and indexes.

Are there really 1.1 million rows in that one week?

来源：https://stackoverflow.com/questions/48985125/optimize-a-query-that-group-results-by-a-field-from-the-joined-table

标签

mysql

sql

join

group-by

query-optimization