Why would regex to separate filename from extension not work in ColdFusion?

问题

I'm trying to retrieve a filename without the extension in ColdFusion. I am using the following function: REMatchNoCase( "(.+?)(\.[^.]*$|$)" , "Doe, John 8.15.2012.docx" );

I would like this to return an array like: ["Doe, John 8.15.2012","docx"] but instead I always get an array with one element - the entire filename:["Doe, John 8.15.2012.docx"]

I tried the regex string above on rexv.org and it works as expected, but not on ColdFusion. I got the string from this SO question: Regex: Get Filename Without Extension in One Shot?

Does ColdFusion use a different syntax? Or am I doing something wrong?

Thanks.

回答1:

Why you're not getting expected results...

The reason you are getting a one-item array with the whole filename is because your pattern matches the entire filename, and matches once.

It is capturing the two groups, but rematch returns arrays of matches, not arrays of the captured groups, so you don't see those groups.

How to solve the problem...

If you are dealing with simple files (i.e. no .htaccess or similar), then the simplest solution is to just use...

ListLast( filename , '.' )

....to get only the file extension and to get the name without extension you can do...

rematch( '.+(?=\.[^.]+$)' , filename )

This uses a lookahead to ensure there is a . followed by at least one non-. at the end of the string, but (since it's a lookahead) it is excluded from the match (so you only get the pre-extension part in your match).

To deal with non-extensioned files (e.g. .htaccess or README) you can modify the above regex to .+(?=(?:\.[^.]+)?$) which basically does the same thing except making the extension optional. However, there isn't a trivial way to get update the ListLast method for these (guess you'd need to check len(extension) LT len(filename)-1 or similar).

(optional) Accessing captured groups...

If you want to get at the actual captured groups, the closest native way to do this in CF is using the refind function, with the fourth argument set to true - however, this only gives you positions and lengths - requiring that you use mid to extract them yourself.

For this reason (amongst many others), I've created an improved regex implementation for CF, called cfRegex, which lets you return the group text directly (i.e. no messing around with mid).

If you wanted to use cfRegex, you can do so with your original pattern like so:

RegexMatch( '(.+?)(\.[^.]*$|$)' , filename , 1 , 0 , 'groups' )

Or with named arguments:

RegexMatch( pattern='(.+?)(\.[^.]*$|$)' , text=filename , returntype='groups' )

And you get returned an array of matches, within each element being an array of the captured groups for that match.

If you're doing lots of regex work dealing with captured groups, cfRegex is definitely better than doing it with CF's re methods.

If all you care about is getting the extension and/or the filename with extension excluded then the previous examples above are sufficient.

回答2:

@Peter's response is great, however the approach is perhaps a bit longer-winded than necessary. One can do this with reMatch() with a slight tweak to the regex.

<cfscript>
    param name="URL.filename";

    sRegex = "^.+?(?=(?:\.[^.]+?)?$)";

    aMatch = reMatch(sRegex, URL.filename);

    writeDump(aMatch);
</cfscript>

This works on the following filename patterns:

foo.bar
foo
.htaccess
John 8.15.2012.docx

Explanation of the regex:

^ From the beginning of the string

.+? One or more (+) characters (.), but the fewest (?) that will work with the rest of the regex. This is the file name.

(?=) Look ahead. Make sure the stuff in here appears in the string, but don't actually match it. This is the key bit to NOT return any file extension that might be present.

(?: Group this stuff together, but don't remember it for a back reference.

. A dot. This is the separator between file name and file extension.

[^.]+? One or more (+) single ([]) non-dot characters (^.), again matching the fewest possible (?) that will allow the regex as a whole to work.

? (This is the one after the (?:) group). Zero or one of those groups: ie: zero or one file extensions.

$ To the end of the string

I've only tested with those four file name patterns, but it seems to work OK. Other people might be able to finetune it.

回答3:

A few more ways of achieving the same result. They all execute in roughly the same amount of time.

<cfscript>
str = 'Doe, John 8.15.2012.docx';

// sans regex
arr1 = [
    reverse( listRest( reverse( str ), '.' ) ),
    listLast( str, '.' )
];

// using Java String lastIndexOf()
arr2 = [
    str.substring( 0, str.lastIndexOf( '.' ) ),
    str.substring( str.lastIndexOf( '.' ) + 1 )
];

// using listToArray with non-filename safe character replace
arr3 = listToArray( str.replaceAll( '\.([^\.]+)$', '|$1' ), '|' );
</cfscript>

来源：https://stackoverflow.com/questions/11302267/why-would-regex-to-separate-filename-from-extension-not-work-in-coldfusion

标签

regex

coldfusion

coldfusion-9