I want to create a sane/safe filename (i.e. somewhat readable, no "strange" characters, etc.) from some random Unicode string (mich might contain just anything).
(It doesn't matter for me wether the function is Cocoa, ObjC, Python, etc.)
Of course, there might be infinite many characters which might be strange. Thus, it is not really a solution to have a blacklist and to add more and more to that list over the time.
I could have a whitelist. However, I don't really know how to define it. [a-zA-Z0-9 .]
is a start but I also want to accept unicode chars which can be displayed in a normal way.
Python:
"".join([c for c in filename if c.isalpha() or c.isdigit() or c==' ']).rstrip()
this accepts Unicode characters but removes line breaks, etc.
example:
filename = u"ad\nbla'{-+\)(ç?"
gives: adblaç
edit str.isalnum() does alphanumeric on one step. – comment from queueoverflow below. danodonovan hinted on keeping a dot included.
keepcharacters = (' ','.','_')
"".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()
My requirements were conservative ( the generated filenames needed to be valid on multiple operating systems, including some ancient mobile OSs ). I ended up with:
"".join([c for c in text if re.match(r'\w', c)])
That white lists the alphanumeric characters ( a-z, A-Z, 0-9 ) and the underscore. The regular expression can be compiled and cached for efficiency, if there are a lot of strings to be matched. For my case, it wouldn't have made any significant difference.
There are a few reasonable answers here, but in my case I want to take something which is a string which might have spaces and punctuation and rather than just removing those, i would rather replace it with an underscore. Even though spaces are an allowable filename character in most OS's they are problematic. Also, in my case if the original string contained a period I didn't want that to pass through into the filename, or it would generate "extra extensions" that I might not want (I'm appending the extension myself)
def make_safe_filename(s):
def safe_char(c):
if c.isalnum():
return c
else:
return "_"
return "".join(safe_char(c) for c in s).rstrip("_")
print(make_safe_filename( "hello you crazy $#^#& 2579 people!!! : die!!!" ) + ".gif")
prints:
hello_you_crazy_______2579_people______die___.gif
More or less what has been mentioned here with regexp, but in reverse (replace any NOT listed):
>>> import re
>>> filename = u"ad\nbla'{-+\)(ç1?"
>>> re.sub(r'[^\w\d-]','_',filename)
u'ad_bla__-_____1_'
No solutions here, only problems that you must consider:
what is your minimum maximum filename length? (e.g. DOS supporting only 8-11 characters; most OS don't support >256 characters)
what filenames are forbidden in some context? (Windows still doesn't support saving a file as
CON.TXT
-- see https://blogs.msdn.microsoft.com/oldnewthing/20031022-00/?p=42073)remember that
.
and..
have specific meanings (current/parent directory) and are therefore unsafe.is there a risk that filenames will collide -- either due to removal of characters or the same filename being used multiple times?
Consider just hashing the data and using the hexdump of that as a filename?
Python:
for c in r'[]/\;,><&*:%=+@!#^()|?^':
filename = filename.replace(c,'')
(just an example of characters you will want to remove)
The r
in front of the string makes sure the string is interpreted in it's raw format, allowing you to remove backslash \
as well
Edit: regex solution in Python:
import re
re.sub(r'[]/\;,><&*:%=+@!#^()|?^', '', filename)
来源:https://stackoverflow.com/questions/7406102/create-sane-safe-filename-from-any-unsafe-string