I am running many instances of a webcrawler in parallel.
Each crawler selects a domain from a table, inserts that url and a start time into a log table, and then sta
I wouldn't use locking, or transactions.
The easiest way to go is to INSERT a record in the logging table if it's not yet present, and then check for that record.
Assume you have tblcrawels (cra_id)
that is filled with your crawlers and tblurl (url_id)
that is filled with the URLs, and a table tbllogging (log_cra_id, log_url_id)
for your logfile.
You would run the following query if crawler 1 wants to start crawling url 2:
INSERT INTO tbllogging (log_cra_id, log_url_id)
SELECT 1, url_id FROM tblurl LEFT JOIN tbllogging on url_id=log_url
WHERE url_id=2 AND log_url_id IS NULL;
The next step is to check whether this record has been inserted.
SELECT * FROM tbllogging WHERE log_url_id=2 AND log_cra_id=1
If you get any results then crawler 1 can crawl this url. If you don't get any results this means that another crawler has inserted in the same line and is already crawling.