string sanitizer for filename

后端 未结 17 1296
长情又很酷
长情又很酷 2020-11-27 13:23

I\'m looking for a php function that will sanitize a string and make it ready to use for a filename. Anyone know of a handy one?

( I could write one, but I\'m worri

相关标签:
17条回答
  • 2020-11-27 14:07

    These may be a bit heavy, but they're flexible enough to sanitize whatever string into a "safe" en style filename or folder name (or heck, even scrubbed slugs and things if you bend it).

    1) Building a full filename (with fallback name in case input is totally truncated):

    str_file($raw_string, $word_separator, $file_extension, $fallback_name, $length);
    

    2) Or using just the filter util without building a full filename (strict mode true will not allow [] or () in filename):

    str_file_filter($string, $separator, $strict, $length);
    

    3) And here are those functions:

    // Returns filesystem-safe string after cleaning, filtering, and trimming input
    function str_file_filter(
        $str,
        $sep = '_',
        $strict = false,
        $trim = 248) {
    
        $str = strip_tags(htmlspecialchars_decode(strtolower($str))); // lowercase -> decode -> strip tags
        $str = str_replace("%20", ' ', $str); // convert rogue %20s into spaces
        $str = preg_replace("/%[a-z0-9]{1,2}/i", '', $str); // remove hexy things
        $str = str_replace(" ", ' ', $str); // convert all nbsp into space
        $str = preg_replace("/&#?[a-z0-9]{2,8};/i", '', $str); // remove the other non-tag things
        $str = preg_replace("/\s+/", $sep, $str); // filter multiple spaces
        $str = preg_replace("/\.+/", '.', $str); // filter multiple periods
        $str = preg_replace("/^\.+/", '', $str); // trim leading period
    
        if ($strict) {
            $str = preg_replace("/([^\w\d\\" . $sep . ".])/", '', $str); // only allow words and digits
        } else {
            $str = preg_replace("/([^\w\d\\" . $sep . "\[\]\(\).])/", '', $str); // allow words, digits, [], and ()
        }
    
        $str = preg_replace("/\\" . $sep . "+/", $sep, $str); // filter multiple separators
        $str = substr($str, 0, $trim); // trim filename to desired length, note 255 char limit on windows
    
        return $str;
    }
    
    
    // Returns full file name including fallback and extension
    function str_file(
        $str,
        $sep = '_',
        $ext = '',
        $default = '',
        $trim = 248) {
    
        // Run $str and/or $ext through filters to clean up strings
        $str = str_file_filter($str, $sep);
        $ext = '.' . str_file_filter($ext, '', true);
    
        // Default file name in case all chars are trimmed from $str, then ensure there is an id at tail
        if (empty($str) && empty($default)) {
            $str = 'no_name__' . date('Y-m-d_H-m_A') . '__' . uniqid();
        } elseif (empty($str)) {
            $str = $default;
        }
    
        // Return completed string
        if (!empty($ext)) {
            return $str . $ext;
        } else {
            return $str;
        }
    }
    

    So let's say some user input is: .....&lt;div&gt;&lt;/div&gt;<script></script>&amp; Weiß Göbel 中文百强网File name %20 %20 %21 %2C Décor \/. /. . z \... y \...... x ./ “This name” is & 462^^ not &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული

    And we wanna convert it to something friendlier to make a tar.gz with a file name length of 255 chars. Here is an example use. Note: this example includes a malformed tar.gz extension as a proof of concept, you should still filter the ext after string is built against your whitelist(s).

    $raw_str = '.....&lt;div&gt;&lt;/div&gt;<script></script>&amp; Weiß Göbel 中文百强网File name  %20   %20 %21 %2C Décor  \/.  /. .  z \... y \...... x ./  “This name” is & 462^^ not &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = that grrrreat -][09]()1234747) საბეჭდი-და-ტიპოგრაფიული';
    $fallback_str = 'generated_' . date('Y-m-d_H-m_A');
    $bad_extension = '....t&+++a()r.gz[]';
    
    echo str_file($raw_str, '_', $bad_extension, $fallback_str);
    

    The output would be: _wei_gbel_file_name_dcor_._._._z_._y_._x_._this_name_is_462_not_that_grrrreat_][09]()1234747)_.tar.gz

    You can play with it here: https://3v4l.org/iSgi8

    Or a Gist: https://gist.github.com/dhaupin/b109d3a8464239b7754a

    EDIT: updated script filter for &nbsp; instead of space, updated 3v4l link

    0 讨论(0)
  • 2020-11-27 14:08
    preg_replace("[^\w\s\d\.\-_~,;:\[\]\(\]]", '', $file)
    

    Add/remove more valid characters depending on what is allowed for your system.

    Alternatively you can try to create the file and then return an error if it's bad.

    0 讨论(0)
  • 2020-11-27 14:09

    This is how you can sanitize for a file system as asked

    function filter_filename($name) {
        // remove illegal file system characters https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
        $name = str_replace(array_merge(
            array_map('chr', range(0, 31)),
            array('<', '>', ':', '"', '/', '\\', '|', '?', '*')
        ), '', $name);
        // maximise filename length to 255 bytes http://serverfault.com/a/9548/44086
        $ext = pathinfo($name, PATHINFO_EXTENSION);
        $name= mb_strcut(pathinfo($name, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($name)) . ($ext ? '.' . $ext : '');
        return $name;
    }
    

    Everything else is allowed in a filesystem, so the question is perfectly answered...

    ... but it could be dangerous to allow for example single quotes ' in a filename if you use it later in an unsafe HTML context because this absolutely legal filename:

     ' onerror= 'alert(document.cookie).jpg
    

    becomes an XSS hole:

    <img src='<? echo $image ?>' />
    // output:
    <img src=' ' onerror= 'alert(document.cookie)' />
    

    Because of that, the popular CMS software Wordpress removes them, but they covered all relevant chars only after some updates:

    $special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", "%", "+", chr(0));
    // ... a few rows later are whitespaces removed as well ...
    preg_replace( '/[\r\n\t -]+/', '-', $filename )
    

    Finally their list includes now most of the characters that are part of the URI rerserved-characters and URL unsafe characters list.

    Of course you could simply encode all these chars on HTML output, but most developers and me too, follow the idiom "Better safe than sorry" and delete them in advance.

    So finally I would suggest to use this:

    function filter_filename($filename, $beautify=true) {
        // sanitize filename
        $filename = preg_replace(
            '~
            [<>:"/\\|?*]|            # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
            [\x00-\x1F]|             # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
            [\x7F\xA0\xAD]|          # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
            [#\[\]@!$&\'()+,;=]|     # URI reserved https://tools.ietf.org/html/rfc3986#section-2.2
            [{}^\~`]                 # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
            ~x',
            '-', $filename);
        // avoids ".", ".." or ".hiddenFiles"
        $filename = ltrim($filename, '.-');
        // optional beautification
        if ($beautify) $filename = beautify_filename($filename);
        // maximize filename length to 255 bytes http://serverfault.com/a/9548/44086
        $ext = pathinfo($filename, PATHINFO_EXTENSION);
        $filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - ($ext ? strlen($ext) + 1 : 0), mb_detect_encoding($filename)) . ($ext ? '.' . $ext : '');
        return $filename;
    }
    

    Everything else that does not cause problems with the file system should be part of an additional function:

    function beautify_filename($filename) {
        // reduce consecutive characters
        $filename = preg_replace(array(
            // "file   name.zip" becomes "file-name.zip"
            '/ +/',
            // "file___name.zip" becomes "file-name.zip"
            '/_+/',
            // "file---name.zip" becomes "file-name.zip"
            '/-+/'
        ), '-', $filename);
        $filename = preg_replace(array(
            // "file--.--.-.--name.zip" becomes "file.name.zip"
            '/-*\.-*/',
            // "file...name..zip" becomes "file.name.zip"
            '/\.{2,}/'
        ), '.', $filename);
        // lowercase for windows/unix interoperability http://support.microsoft.com/kb/100625
        $filename = mb_strtolower($filename, mb_detect_encoding($filename));
        // ".file-name.-" becomes "file-name"
        $filename = trim($filename, '.-');
        return $filename;
    }
    

    And at this point you need to generate a filename if the result is empty and you can decide if you want to encode UTF-8 characters. But you do not need that as UTF-8 is allowed in all file systems that are used in web hosting contexts.

    The only thing you have to do is to use urlencode() (as you hopefully do it with all your URLs) so the filename საბეჭდი_მანქანა.jpg becomes this URL as your <img src> or <a href>: http://www.maxrev.de/html/img/%E1%83%A1%E1%83%90%E1%83%91%E1%83%94%E1%83%AD%E1%83%93%E1%83%98_%E1%83%9B%E1%83%90%E1%83%9C%E1%83%A5%E1%83%90%E1%83%9C%E1%83%90.jpg

    Stackoverflow does that, so I can post this link as a user would do it:
    http://www.maxrev.de/html/img/საბეჭდი_მანქანა.jpg

    So this is a complete legal filename and not a problem as @SequenceDigitale.com mentioned in his answer.

    0 讨论(0)
  • 2020-11-27 14:12

    one way

    $bad='/[\/:*?"<>|]/';
    $string = 'fi?le*';
    
    function sanitize($str,$pat)
    {
        return preg_replace($pat,"",$str);
    
    }
    echo sanitize($string,$bad);
    
    0 讨论(0)
  • 2020-11-27 14:16

    SOLUTION 1 - simple and effective

    $file_name = preg_replace( '/[^a-z0-9]+/', '-', strtolower( $url ) );

    • strtolower() guarantees the filename is lowercase (since case does not matter inside the URL, but in the NTFS filename)
    • [^a-z0-9]+ will ensure, the filename only keeps letters and numbers
    • Substitute invalid characters with '-' keeps the filename readable

    Example:

    URL:  http://stackoverflow.com/questions/2021624/string-sanitizer-for-filename
    File: http-stackoverflow-com-questions-2021624-string-sanitizer-for-filename
    

    SOLUTION 2 - for very long URLs

    You want to cache the URL contents and just need to have unique filenames. I would use this function:

    $file_name = md5( strtolower( $url ) )

    this will create a filename with fixed length. The MD5 hash is in most cases unique enough for this kind of usage.

    Example:

    URL:  https://www.amazon.com/Interstellar-Matthew-McConaughey/dp/B00TU9UFTS/ref=s9_nwrsa_gw_g318_i10_r?_encoding=UTF8&fpl=fresh&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-1&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_r=BS5M1H560SMAR2JDKYX3&pf_rd_t=36701&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_p=6822bacc-d4f0-466d-83a8-2c5e1d703f8e&pf_rd_i=desktop
    File: 51301f3edb513f6543779c3a5433b01c
    
    0 讨论(0)
提交回复
热议问题