Mysql debezium connector for rds in production caused deadlocks

问题

We are creating a data pipeline from Mysql in RDS to elastic search for creating search indexes, and for this using debezium cdc with its mysql source and elastic sink connector.

Now as the mysql is in rds we have to give the mysql user LOCK TABLE permission for two tables we wanted cdc, as mentioned in docs.

We also have various other mysql users performing transactions which may require any of the two tables.

As soon as we connected the mysql connector to our production database there was a lock created and our whole system went down, after realising this we soon stopped the kafka and also removed the connector, but the locks where still increasing and it only solved after we stop all the new queries by stopping our production code from running and manually killing the processes.

What could be the potential cause for this, and how could we prevent this ?

回答1:

Use the replica to prevent lock table statement getting executed, why debezium need lock table? all CDC tool fetch the events from bin logs.

回答2:

I'm only guessing because I don't know your query traffic. I would assume the locks you saw increasing were the backlog of queries that had been waiting for the table locks to be released.

I mean the following sequence is what I believe happened:

Debezium starts table locks on your two tables.
The application is still working, and it is trying to execute queries that access those locked tables. The queries begin waiting for the lock to be released. They will wait for up to 1 year (this is the default lock_wait_timeout value).
As you spend some minutes trying to figure out why your site is not responding, a large number of blocked queries accumulate. Potentially as many as max_connections. After all the allowed connections are full of blocked queries, then the application cannot connect to MySQL at all.
Finally you stop the Debezium process that is trying to read its initial snapshot of data. It releases its table locks.
Immediately when the table locks are released, the waiting queries can proceed.
- But many of them do need to acquire locks too, if they are INSERT/UPDATE/DELETE/REPLACE or if they are SELECT ... FOR UPDATE or other locking statements.
- Since there are so many of these queries queued up, it's more likely for them to be requesting locks that overlap, which means they have to wait for each other to finish and release their locks.
- Also because there are hundreds of queries executing at the same time, they are overtaxing system resources like CPU, causing high system load, and this makes them all slow down too. So it will take longer for queries to complete, and therefore if they are blocked each other, they have to wait longer.
Meanwhile the application is still trying to accept requests, and therefore is adding more queries to execute. They are also subject to the queueing and resource exhaustion.
Eventually you stop the application, which at least allows the queue of waiting queries to gradually be finished. As the system load goes down, MySQL is able to process the queries more efficiently and finishes them all pretty soon.

The suggestion by the other answer to use a read replica for your Debezium snapshot is a good one. If your application can read from the master MySQL instance for a while, then no query will be blocked on the replica while Debezium has it locked. Eventually Debezium will finish reading all the data, and release the locks, and then go on to read only the binlog. Then the app can resume using the replica as a read instance.

If your binlog uses GTID, you should be able to make a CDC tool like Debezium read the snapshot from the replica, then when that's done, switch to the master to read the binlog. But if you don't use GTID, that's a little more tricky. The tool would have to know the binlog position on the master corresponding to the snapshot on the replica.

回答3:

If the locking is problem and you cannot afford to tradeoff locking vs consistency then please take a look at snapshot.locking.mode config option.

来源：https://stackoverflow.com/questions/58232910/mysql-debezium-connector-for-rds-in-production-caused-deadlocks

标签

mysql

apache-kafka

rds

cdc

debezium