How to check if a value already exists to avoid duplicates?

前端 未结 17 1074
小鲜肉
小鲜肉 2020-12-02 23:13

I\'ve got a table of URLs and I don\'t want any duplicate URLs. How do I check to see if a given URL is already in the table using PHP/MySQL?

17条回答
  •  情书的邮戳
    2020-12-02 23:29

    In considering a solution to this problem, you need to first define what a "duplicate URL" means for your project. This will determine how to canonicalize the URLs before adding them to the database.

    There are at least two definitions:

    1. Two URLs are considered duplicates if they represent the same resource knowing nothing about the corresponding web service that generates the corresponding content. Some considerations include:
      • The scheme and domain name portion of the URLs are case-insensitive, so HTTP://WWW.STACKOVERFLOW.COM/ is the same as http://www.stackoverflow.com/.
      • If one URL specifies a port, but it is the conventional port for the scheme and they are otherwise equivalent, then they are the same ( http://www.stackoverflow.com/ and http://www.stackoverflow.com:80/).
      • If the parameters in the query string are simple rearrangements and the parameter names are all different, then they are the same; e.g. http://authority/?a=test&b=test and http://authority/?b=test&a=test. Note that http://authority/?a%5B%5D=test1&a%5B%5D=test2 is not the same, by this first definition of sameness, as http://authority/?a%5B%5D=test2&a%5B%5D=test1.
      • If the scheme is HTTP or HTTPS, then the hash portions of the URLs can be removed, as this portion of the URL is not sent to the web server.
      • A shortened IPv6 address can be expanded.
      • Append a trailing forward slash to the authority only if it is missing.
      • Unicode canonicalization changes the referenced resource; e.g. you can't conclude that http://google.com/?q=%C3%84 (%C3%84 represents 'Ä' in UTF-8) is the same as http://google.com/?q=A%CC%88 (%CC%88 represents U+0308, COMBINING DIAERESIS).
      • If the scheme is HTTP or HTTPS, 'www.' in one URL's authority can not simply be removed if the two URLs are otherwise equivalent, as the text of the domain name is sent as the value of the Host HTTP header, and some web servers use virtual hosts to send back different content based on this header. More generally, even if the domain names resolve to the same IP address, you can not conclude that the referenced resources are the same.
    2. Apply basic URL canonicalization (e.g. lower case the scheme and domain name, supply the default port, stable sort query parameters by parameter name, remove the hash portion in the case of HTTP and HTTPS, ...), and take into account knowledge of the web service. Maybe you will assume that all web services are smart enough to canonicalize Unicode input (Wikipedia is, for example), so you can apply Unicode Normalization Form Canonical Composition (NFC). You would strip 'www.' from all Stack Overflow URLs. You could use PostRank's postrank-uri code, ported to PHP, to remove all sorts of pieces of the URLs that are unnecessary (e.g. &utm_source=...).

    Definition 1 leads to a stable solution (i.e. there is no further canonicalization that can be performed and the canonicalization of a URL will not change). Definition 2, which I think is what a human considers the definition of URL canonicalization, leads to a canonicalization routine that can yield different results at different moments in time.

    Whichever definition you choose, I suggest that you use separate columns for the scheme, login, host, port, and path portions. This will allow you to use indexes intelligently. The columns for scheme and host can use a character collation (all character collations are case-insensitive in MySQL), but the columns for the login and path need to use a binary, case-insensitive collation. Also, if you use Definition 2, you need to preserve the original scheme, authority, and path portions, as certain canonicalization rules might be added or removed from time to time.

    EDIT: Here are example table definitions:

    CREATE TABLE `urls1` (
        `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
        `scheme` VARCHAR(20) NOT NULL,
        `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
        `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci', /* the "ci" stands for case-insensitive. Also, we want 'utf8mb4_unicode_ci'
    rather than 'utf8mb4_general_ci' because 'utf8mb4_general_ci' treats accented characters as equivalent. */
        `port` INT UNSIGNED,
        `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
    
        PRIMARY KEY (`id`),
        INDEX (`canonical_host`(10), `scheme`)
    ) ENGINE = 'InnoDB';
    
    
    CREATE TABLE `urls2` (
        `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
        `canonical_scheme` VARCHAR(20) NOT NULL,
        `canonical_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
        `canonical_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
        `port` INT UNSIGNED,
        `canonical_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
    
        `orig_scheme` VARCHAR(20) NOT NULL, 
        `orig_login` VARCHAR(100) DEFAULT NULL COLLATE 'utf8mb4_bin',
        `orig_host` VARCHAR(100) NOT NULL COLLATE 'utf8mb4_unicode_ci',
        `orig_path` VARCHAR(4096) NOT NULL COLLATE 'utf8mb4_bin',
    
        PRIMARY KEY (`id`),
        INDEX (`canonical_host`(10), `canonical_scheme`),
        INDEX (`orig_host`(10), `orig_scheme`)
    ) ENGINE = 'InnoDB';
    

    Table `urls1` is for storing canonical URLs according to definition 1. Table `urls2` is for storing canonical URLs according to definition 2.

    Unfortunately you will not be able to specify a UNIQUE constraint on the tuple (`scheme`/`canonical_scheme`, `canonical_login`, `canonical_host`, `port`, `canonical_path`) as MySQL limits the length of InnoDB keys to 767 bytes.

提交回复
热议问题