Regex - Extracting volume and chapter numbers from book titles

六眼飞鱼酱① 提交于 2020-01-15 06:33:10

问题


Hey,
I'm trying to import some legacy data into a brand new system, it's almost done, but there's a huge problem! Assuming these kinda data:

Blabla Vol.1 chapter 2
ABCD in the era of XYZ volume 2 First Chapter  
A really useless book Eighth vol  
Blala Sixth Vol Chapter 5  
Lablah V6C7 2002  
FooBar Vol6 C3 by Dr. Foo Bar
Regex: A tool in Hell V1 Eleventh Chapter

Confused!! I tried to write that regex to extract volume and chapter numbers but you know it's REGEX! Can anyone please guide me through this?


回答1:


Here is a regular expression that will match your example :

/^.+?(?|(?:\bVol.?|\bvolume[ ]+|V)(\d+)|[ ]+([a-z]+)[ ]+vol\b).?(?:(?|(?:C|chapter[ ]+)(\d+)|[ ]+([a-z]+)[ ]+Chapter\b).?)?$/im

You can live edit the regex and/or add tests here.

In this link :

  • element [0] in the array refers to the matches array
  • element [1] the volumes array
  • element [2] the chapter array

  • I assumed that volumes always comes before chapters as stated in your examples.


    回答2:


    In my opinion, it is always best to break this into separate steps. In the first step, you might convert the titles with the pattern "/Vol.[0-9]+\s+chapter\s[0-9]+$/i". In the second pass, you might convert the titles matching the pattern "/[a-z]+(th|nd|st)\svol/i". Etc.

    Trying to write one regular expression to capture all of these cases usually does not end well and is almost always consistently buggy. Here's an interesting article I found the other day detailing the perils of overly complex regexing.




    回答3:


    As these expressions are not "regular" at all, a single regular expression will be difficult. If you have a finite set of "ways" the chapter and volume are displayed, then you could use multiple regular expressions to attempt to extract that information.

    Or if you can define some rules such as "the chapter number is always in the format [chapter #]" then that would also help!




    回答4:


    If the output is always the same things on the same lines the first thing I would do is explode("\n", $data) and work with the correct line. If consistent you could then match for

    '/ (.*) Vol Chapter ([0-9]*)/'

    or something.

    BTW, this page has always helped me with regex testing. http://www.quanetic.com/Regex



    来源:https://stackoverflow.com/questions/5373513/regex-extracting-volume-and-chapter-numbers-from-book-titles

    易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
    该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!