问题
I've been working on getting a regular expression to grab the TV Show or Movie name, the year it was aired if it exist, the season #and the episode # from the file name of a video. I have a regular expression (below) that seems to work well for shows with double year dates (one of the years is in the show/movie name the other is the year it aired) for both movies and TV show. For TV Show it is able to grab the season and episode numbers if the format is in SXXEXX or XXX. I've been testing it out in the regex101.com test engine. Where I'm struggling is the expression won't return anything if a year does not exist in the filename. Also if the filename has a 4 digit number that's actually part of the show name it thinks that is the aired year date (i.e. "The 4400"). How can I modify this expression to be able to handle the extra conditions that I described?
The end goal is I want to put this into a python script that queries a site like TheTVDB.com if the file is a movie or TV show so that I can sort my vast video library into TV Show and Movies folders.
(?P<ShowName>.*)[ (_.]#Show Name
(?=19[0-9]\d|20[0-4]\d|2050) #If after the show name is a year
(?P<ShowYear>\d{4,4}) # Get the show year
| # Else
(?=S\d{1,2}E\d{1,2})
S(?P<Season>\d{1,2})E(?P<Episode>\d{1,2}) #Get the season and Episode information
|
(\d{1})E(\d{1,2})
Here is my test data I'm using
- archer.2009.S04E13
- space 1999 1975
- Space: 1999 (1975)
- Space.1999.1975.S01E01
- space 1999.(1975)
- The.4400.204.mkv
- space 1999 (1975) v.2009.S01E13.the.title.avi
- Teen.wolf.S04E12.HDTV.x264
- Se7en.(1995).avi
- How to train your dragon 2
The regular expression does not work properly with the following test data:
- The.4400.204.mkv
- Teen.wolf.S04E12.HDTV.x264
- How to train your dragon 2
Update: Here is the new expression based on the comments. It works much better but is struggling with the 3 file names listed below the expressions.
(?P<ShowName>.*)#Show Name
(
[ (_.]
(
(?=\d{4,4}) #If after the show name is a year
(?P<ShowYear>\d{4}) # Get the show year
| # Else no year in the file name then just grab the name
(?P<otherShowName>.*) # Grab Show Name
(?=S\d{1,2}E\d{1,2}) # If the Season Episode patterns matches SX{1,2}EX{1,2}, Then
S(?P<Season>\d{1,2})E(?P<Episode>\d{1,2}) #Get the season and Episode information
| # Else
(?P<Alt_S_E>\d{3,4}) # Get the season and Episode that looks like 211
)
|$)
- Se7en
- 10,000BC (2010)
- v.2009.S01E13.the.title.avi
- archer.2009.S04E13
回答1:
I made some modifications to your regex, and it seems to work, if I understood you correctly.
^(
(?P<ShowNameA>.*[^ (_.]) # Show name
[ (_.]+
( # Year with possible Season and Episode
(?P<ShowYearA>\d{4})
([ (_.]+S(?P<SeasonA>\d{1,2})E(?P<EpisodeA>\d{1,2}))?
| # Season and Episode only
(?<!\d{4}[ (_.])
S(?P<SeasonB>\d{1,2})E(?P<EpisodeB>\d{1,2})
| # Alternate format for episode
(?P<EpisodeC>\d{3})
)
|
# Show name with no other information
(?P<ShowNameB>.+)
)
See demo on regex101
EDIT: I've updated the regex to handle those last 3 situations you mentioned in comments.
One main problem was that you had no parens around the main alternation, so it included the whole regex. I also had to add an alternation to allow for none of the year/episode formats following the name.
Because you have so many different possible layouts that possibly conflict with each other, the regex ended up being lots of alternation of different scenarios. For example, to match a title that has no year or episode information at all, I had to add an alternation around the whole regex that if it can't find any known pattern, just match the whole thing.
Note: now that you seem to have expanded show years to match any four digits, there's no need for the lookahead. In other words, (?=\d{4,4})(?P<ShowYear>\d{4}) is the same as (?P<ShowYear>\d{4}). This also means that your alternate format for episode must match 3 digits only, not 4. Otherwise, there's no way to distinguish a stand-alone 4-digit sequence as a year or episode.
General pattern:
[ (_.]+ the delimiter used throughout
(?P<ShowNameA>.*[^ (_.]) the show name, greedy but not including a delimiter
(?P<ShowNameB>.+) the show name when it's the whole line
Format A (Year with possible Season and Episode):
(?P<ShowYearA>\d{4})
([ (_.]+S(?P<SeasonA>\d{1,2})E(?P<EpisodeA>\d{1,2}))?
Format B (Season and Episode only):
(?<!\d{4}[ (_.])
S(?P<SeasonB>\d{1,2})E(?P<EpisodeB>\d{1,2})
Format C (Alternate format for episode):
(?P<EpisodeC>\d{3})
回答2:
if i may, i adapted brian's regex to match something like
SHOW.NAME.201X.SXXEXX.XSUB.VOSTFR.720p.HDTV.x264-ADDiCTiON.mkv
here it is (PHP PCRE)
/^(
(?P<ShowNameA>.*[^ (_.]) # Show name
[ (_.]+
( # Year with possible Season and Episode
(?P<ShowYearA>\d{4})
([ (_.]+S(?P<SeasonA>\d{1,2})E(?P<EpisodeA>\d{1,2}))?
| # Season and Episode only
(?<!\d{4}[ (_.])
S(?P<SeasonB>\d{1,2})E(?P<EpisodeB>\d{1,2})
)
|
# Show name with no other information
(?P<ShowNameB>.+)
)/mx
来源:https://stackoverflow.com/questions/25807795/matching-tv-and-movie-file-names-with-regex