I\'ve got a table of URLs and I don\'t want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?
In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalize the URLs before adding them to the database.
There are at least two definitions:
%C3%84 represents 'Ä' in UTF-8) is the same as http://google.com/?q=A%CC%88 (%CC%88 represents U+0308, COMBINING DIAERESIS).www.' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the Host HTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.www.' from all Stack Overflow URLs. You could use PostRank's postrank-uri code, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...).Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.
Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.
EDIT: Here are example table definitions:
CREATE TABLE `urls1` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`scheme` VARCHAR(20) NOT NULL,
`canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
`port` INT UNSIGNED,
`canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
PRIMARY KEY (`id`),
INDEX (`canonical_host`(10), `scheme`)
) ENGINE = 'InnoDB';
CREATE TABLE `urls2` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`canonical_scheme` VARCHAR(20) NOT NULL,
`canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`port` INT UNSIGNED,
`canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
`orig_scheme` VARCHAR(20) NOT NULL,
`orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
`orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
PRIMARY KEY (`id`),
INDEX (`canonical_host`(10), `canonical_scheme`),
INDEX (`orig_host`(10), `orig_scheme`)
) ENGINE = 'InnoDB';
Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.
Unfortunately you will not be able to specify a UNIQUE constraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.