How do I lock read/write to MySQL tables so that I can select and then insert without other programs reading/writing to the database?

前端 未结 5 937
梦毁少年i
梦毁少年i 2020-12-14 06:15

I am running many instances of a webcrawler in parallel.

Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then sta

相关标签:
5条回答
  • 2020-12-14 06:48

    I got some inspiration from @Eljakim's answer and started this new thread where I figured out a great trick. It doesn't involve locking anything and is very simple.

    INSERT INTO crawlLog (companyId, timeStartCrawling)
    SELECT id, now()
    FROM companies
    WHERE id NOT IN
    (
        SELECT companyId
        FROM crawlLog AS crawlLogAlias
    )
    LIMIT 1
    
    0 讨论(0)
  • 2020-12-14 06:49

    You probably don't want to lock the table. If you do that you'll have to worry about trapping errors when the other crawlers try to write to the database - which is what you were thinking when you said "...terribly complex and relies on many other things."

    Instead you should probably wrap the group of queries in a MySQL transaction (see http://dev.mysql.com/doc/refman/5.0/en/commit.html) like this:

    START TRANSACTION;
    SELECT @URL:=url FROM tablewiththeurls WHERE uncrawled=1 ORDER BY somecriterion LIMIT 1;
    INSERT INTO loggingtable SET url=@URL;
    COMMIT;
    

    Or something close to that.

    [edit] I just realized - you could probably do everything you need in a single query and not even have to worry about transactions. Something like this:

    INSERT INTO loggingtable (url) SELECT url FROM tablewithurls u LEFT JOIN loggingtable l ON l.url=t.url WHERE {some criterion used to pick the url to work on} AND l.url IS NULL.
    
    0 讨论(0)
  • 2020-12-14 06:52

    You can lock tables using the MySQL LOCK TABLES command like this:

    LOCK TABLES tablename WRITE;
    
    # Do other queries here
    
    UNLOCK TABLES;
    

    See:

    http://dev.mysql.com/doc/refman/5.5/en/lock-tables.html

    0 讨论(0)
  • 2020-12-14 06:59

    Well, table locks are one way to deal with that; but this makes parallel requests impossible. If the table is InnoDB you could force a row lock instead, using SELECT ... FOR UPDATE within a transaction.

    BEGIN;
    
    SELECT ... FROM your_table WHERE domainname = ... FOR UPDATE
    
    # do whatever you have to do
    
    COMMIT;
    

    Please note that you will need an index on domainname (or whatever column you use in the WHERE-clause) for this to work, but this makes sense in general and I assume you will have that anyway.

    0 讨论(0)
  • 2020-12-14 07:04

    I wouldn't use locking, or transactions.

    The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.

    Assume you have tblcrawels (cra_id) that is filled with your crawlers and tblurl (url_id) that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id) for your logfile.

    You would run the following query if crawler 1 wants to start crawling url 2:

    INSERT INTO tbllogging (log_cra_id, log_url_id) 
    SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url 
    WHERE url_id=2 AND log_url_id IS NULL;
    

    The next step is to check whether this record has been inserted.

    SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1
    

    If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.

    0 讨论(0)
提交回复
热议问题