string sanitizer for filename

后端 未结 17 1294
长情又很酷
长情又很酷 2020-11-27 13:23

I\'m looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?

( I could write one, but I\'m worri

相关标签:
17条回答
  • 2020-11-27 13:50

    The best I know today is static method Strings::webalize from Nette framework.

    BTW, this translates all diacritic signs to their basic.. š=>s ü=>u ß=>ss etc.

    For filenames you have to add dot "." to allowed characters parameter.

    /**
     * Converts to ASCII.
     * @param  string  UTF-8 encoding
     * @return string  ASCII
     */
    public static function toAscii($s)
    {
        static $transliterator = NULL;
        if ($transliterator === NULL && class_exists('Transliterator', FALSE)) {
            $transliterator = \Transliterator::create('Any-Latin; Latin-ASCII');
        }
    
        $s = preg_replace('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{2FF}\x{370}-\x{10FFFF}]#u', '', $s);
        $s = strtr($s, '`\'"^~?', "\x01\x02\x03\x04\x05\x06");
        $s = str_replace(
            array("\xE2\x80\x9E", "\xE2\x80\x9C", "\xE2\x80\x9D", "\xE2\x80\x9A", "\xE2\x80\x98", "\xE2\x80\x99", "\xC2\xB0"),
            array("\x03", "\x03", "\x03", "\x02", "\x02", "\x02", "\x04"), $s
        );
        if ($transliterator !== NULL) {
            $s = $transliterator->transliterate($s);
        }
        if (ICONV_IMPL === 'glibc') {
            $s = str_replace(
                array("\xC2\xBB", "\xC2\xAB", "\xE2\x80\xA6", "\xE2\x84\xA2", "\xC2\xA9", "\xC2\xAE"),
                array('>>', '<<', '...', 'TM', '(c)', '(R)'), $s
            );
            $s = @iconv('UTF-8', 'WINDOWS-1250//TRANSLIT//IGNORE', $s); // intentionally @
            $s = strtr($s, "\xa5\xa3\xbc\x8c\xa7\x8a\xaa\x8d\x8f\x8e\xaf\xb9\xb3\xbe\x9c\x9a\xba\x9d\x9f\x9e"
                . "\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3"
                . "\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8"
                . "\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe"
                . "\x96\xa0\x8b\x97\x9b\xa6\xad\xb7",
                'ALLSSSSTZZZallssstzzzRAAAALCCCEEEEIIDDNNOOOOxRUUUUYTsraaaalccceeeeiiddnnooooruuuuyt- <->|-.');
            $s = preg_replace('#[^\x00-\x7F]++#', '', $s);
        } else {
            $s = @iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $s); // intentionally @
        }
        $s = str_replace(array('`', "'", '"', '^', '~', '?'), '', $s);
        return strtr($s, "\x01\x02\x03\x04\x05\x06", '`\'"^~?');
    }
    
    
    /**
     * Converts to web safe characters [a-z0-9-] text.
     * @param  string  UTF-8 encoding
     * @param  string  allowed characters
     * @param  bool
     * @return string
     */
    public static function webalize($s, $charlist = NULL, $lower = TRUE)
    {
        $s = self::toAscii($s);
        if ($lower) {
            $s = strtolower($s);
        }
        $s = preg_replace('#[^a-z0-9' . preg_quote($charlist, '#') . ']+#i', '-', $s);
        $s = trim($s, '-');
        return $s;
    }
    
    0 讨论(0)
  • 2020-11-27 13:52

    It seems this all hinges on the question, is it possible to create a filename that can be used to hack into a server (or do some-such other damage). If not, then it seems the simple answer to is try creating the file wherever it will, ultimately, be used (since that will be the operating system of choice, no doubt). Let the operating system sort it out. If it complains, port that complaint back to the User as a Validation Error.

    This has the added benefit of being reliably portable, since all (I'm pretty sure) operating systems will complain if the filename is not properly formed for that OS.

    If it is possible to do nefarious things with a filename, perhaps there are measures that can be applied before testing the filename on the resident operating system -- measures less complicated than a full "sanitation" of the filename.

    0 讨论(0)
  • 2020-11-27 13:54

    Making a small adjustment to Tor Valamo's solution to fix the problem noticed by Dominic Rodger, you could use:

    // Remove anything which isn't a word, whitespace, number
    // or any of the following caracters -_~,;[]().
    // If you don't need to handle multi-byte characters
    // you can use preg_replace rather than mb_ereg_replace
    // Thanks @Łukasz Rysiak!
    $file = mb_ereg_replace("([^\w\s\d\-_~,;\[\]\(\).])", '', $file);
    // Remove any runs of periods (thanks falstro!)
    $file = mb_ereg_replace("([\.]{2,})", '', $file);
    
    0 讨论(0)
  • 2020-11-27 13:57

    Well, tempnam() will do it for you.

    http://us2.php.net/manual/en/function.tempnam.php

    but that creates an entirely new name.

    To sanitize an existing string just restrict what your users can enter and make it letters, numbers, period, hyphen and underscore then sanitize with a simple regex. Check what characters need to be escaped or you could get false positives.

    $sanitized = preg_replace('/[^a-zA-Z0-9\-\._]/','', $filename);
    
    0 讨论(0)
  • 2020-11-27 14:00

    What about using rawurlencode() ? http://www.php.net/manual/en/function.rawurlencode.php

    Here is a function that sanitize even Chinese Chars:

    public static function normalizeString ($str = '')
    {
        $str = strip_tags($str); 
        $str = preg_replace('/[\r\n\t ]+/', ' ', $str);
        $str = preg_replace('/[\"\*\/\:\<\>\?\'\|]+/', ' ', $str);
        $str = strtolower($str);
        $str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );
        $str = htmlentities($str, ENT_QUOTES, "utf-8");
        $str = preg_replace("/(&)([a-z])([a-z]+;)/i", '$2', $str);
        $str = str_replace(' ', '-', $str);
        $str = rawurlencode($str);
        $str = str_replace('%', '-', $str);
        return $str;
    }
    

    Here is the explaination

    1. Strip HTML Tags
    2. Remove Break/Tabs/Return Carriage
    3. Remove Illegal Chars for folder and filename
    4. Put the string in lower case
    5. Remove foreign accents such as Éàû by convert it into html entities and then remove the code and keep the letter.
    6. Replace Spaces with dashes
    7. Encode special chars that could pass the previous steps and enter in conflict filename on server. ex. "中文百强网"
    8. Replace "%" with dashes to make sure the link of the file will not be rewritten by the browser when querying th file.

    OK, some filename will not be releavant but in most case it will work.

    ex. Original Name: "საბეჭდი-და-ტიპოგრაფიული.jpg"

    Output Name: "-E1-83-A1-E1-83-90-E1-83-91-E1-83-94-E1-83-AD-E1-83-93-E1-83-98--E1-83-93-E1-83-90--E1-83-A2-E1-83-98-E1-83-9E-E1-83-9D-E1-83-92-E1-83-A0-E1-83-90-E1-83-A4-E1-83-98-E1-83-A3-E1-83-9A-E1-83-98.jpg"

    It's better like that than an 404 error.

    Hope that was helpful.

    Carl.

    0 讨论(0)
  • 2020-11-27 14:00

    PHP provides a function to sanitize a text to different format

    filter.filters.sanitize

    How to :

    echo filter_var(
       "Lorem Ipsum has been the industry's",FILTER_SANITIZE_URL
    ); 
    

    Blockquote LoremIpsumhasbeentheindustry's

    0 讨论(0)
提交回复
热议问题