RegEx for parsing chemical formulas

后端 未结 4 1502
生来不讨喜
生来不讨喜 2020-12-03 16:05

I need a way to separate a chemical formula into its components. The result should look like this:

   Ag3PO4 -> [         


        
相关标签:
4条回答
  • 2020-12-03 16:33

    (PO4)2 really sits aside from all.

    Let's start from simple, match items without parenthesis:

    [A-Z][a-z]?\d*
    

    Using regex above we can successfully parse Ag3PO4, H2O, CH3OOH.

    Then we need to somehow add expression for group. Group by itself can be matched using:

    \(.*?\)\d+
    

    So we add or condition:

    [A-Z][a-z]?\d*|\(.*?\)\d+
    

    Regular expression visualization

    Demo

    Which works for given cases. But may be you have some more samples.

    Note: It will have problems with nested parenthesis. Ex. Co3(Fe(CN)6)2

    If you want to handle that case, you can use the following regex:

    [A-Z][a-z]?\d*|(?<!\([^)]*)\(.*\)\d+(?![^(]*\))
    

    Regular expression visualization

    For Objective-C you can use the expression without lookarounds:

    [A-Z][a-z]?\d*|\([^()]*(?:\(.*\))?[^()]*\)\d+
    

    Regular expression visualization

    Demo

    Or regex with repetitions (I don't know such formulas, but in case if there is anything like A(B(CD)3E(FG)4)5 - multiple parenthesis blocks inside one.

    [A-Z][a-z]?\d*|\((?:[^()]*(?:\(.*\))?[^()]*)+\)\d+
    

    Regular expression visualization

    Demo

    0 讨论(0)
  • 2020-12-03 16:36

    This should just about work:

    /(\(?)([A-Z])([a-z]*)([0-9]*)(\))?([0-9]*)/g
    

    Play around with it here: http://refiddle.com/

    0 讨论(0)
  • 2020-12-03 16:43

    this pattern should work depending on you RegEx engine
    ([A-Z][a-z]*\d*)|(\((?:[^()]+|(?R))*\)\d*) with gm option
    Demo

    0 讨论(0)
  • 2020-12-03 16:47

    When you encounter a parenthesis group, you don't want to parse what's inside, right?

    If there are no nested parenthesis groups you can simply use

    [A-Z][a-z]*\d*|\([^)]+\)\d*
    

    \d is a shorcut for [0-9], [^)] means anything but a parenthesis.

    See demo here.

    0 讨论(0)
提交回复
热议问题