问题
I would like to know a few things regarding mysql architecture. 1. How sql process insert, delete, update operations in an indexed table? 2. It is said that changes are only made in the change buffer when the index page is not in the buffer pool. So if changes are made after the buffer pool loads the concerned index page, then it has to alter the same page in disk as well. right? So an operation has to be done in three different places? 3. How NULL values are indexed? where would they be stored in a b+tree? 4. If we update a data which is the clustered index, then when will it be updated in the disk? 5. What happens during bulk loading?
回答1:
How to process insert/update/delete...
- Fetch (and cache) index block(s) needed for locating the row(s) to be updated/deleted, or the blocks where new row(s) will be inserted.
- Fetch the data block(s). Note that all indexes include the
PRIMARY KEY
, which is clustered with the data. - Modify the data block(s) to reflect the changes. Also deal with remembering the old data -- in case of an eventual
ROLLBACK
. - Update unique index blocks (that includes the PK).
- Store non-unique index changes in the change buffer. (As you noted.)
The change buffer is designed to be a 'transparent' to the actual index blocks.
- A lookup by an index will always 'do the right thing', whether the entry is in the CB or not.
- Folding of CB entries back into actual index blocks is done in the 'background' and/or when running out of room. (The CB defaults to 1/4 of the buffer_pool, I think.)
- Sufficient information is stored in the transaction log, such that a crash will not the loss of pending index updates.
- Clearly the CB was invented for performance. An index update can be delayed, and meanwhile, takes a lot less space (often only a few dozen bytes) than the index block (16KB) that needs updating. Multiple changes (usually) can be applied to a single index block -- This is the main savings. But note, because of randomness, UUIDs, MD5, etc, cannot make good use of the CB. A non-unique index on the current datetime/timestamp is a case where the CB's buffering really shines.
(Sorry, my knowledge of the CB is a bit vague for the level at which you are asking. I suggest you read the code.)
NULL
... I believe that is treated as a separate value that sorts before all non-null values in the B+Tree. But to confuse the issue, there is a flag determining whether nulls are treated as equal to each other. And there are restrictions on PRIMARY
/UNIQUE
keys.
Related to NULL... When doing PARTITION BY RANGE
on some variant/function of DATE
or DATETIME
, invalid dates turn into NULL
, which is explicitly stored in the 'first' partition. Newbies are often puzzled as to why partition pruning does not seem to work. (Recommended partial workaround: have a 'first' partition that is otherwise empty.)
Clustered and UNIQUE
indexes... All(?) write operations must check all unique indexes, hence the CB is not involved with such. Note: In InnoDB, the PRIMARY KEY
is always clustered and unique and cannot(?) have NULLs
.
Bulk loading... I find that a 100-row INSERT
will run 10 times as fast as 100 individual INSERTs
. (This is due to parsing, etc.) But at the low level, a batch insert or LOAD DATA
is just a bunch of individual inserts. So, the above discussion applies.
Bonus answers...
"IODKU" (INSERT ... ON DUPLICATE KEY UPDATE
) is pretty much follows the 1..5 steps above. In locating the row to update, it discovers whether to update or insert, then proceeds accordingly.
REPLACE
is really a shorthand for DELETE
, plus UPDATE
. But note this anomaly... If there are two unique keys on the table, a one-row REPLACE
might delete 2 rows before inserting the 1 row.
来源:https://stackoverflow.com/questions/42367493/what-happens-during-the-insertion-deletion-and-update-in-sql