Parsing a chemical formula

前端 未结 5 816
离开以前
离开以前 2020-12-13 19:52

I\'m trying to write a method for an app that takes a chemical formula like \"CH3COOH\" and returns some sort of collection full of their symbols.

CH3COOH would retu

5条回答
  •  一向
    一向 (楼主)
    2020-12-13 20:45

    I have developed a couple of series of articles on how to parse molecular formulas, including more complex formulas like C6H2(NO2)3CH3 .

    The most recent is my presentation "PLY and PyParsing" at PyCon2010 where I compare those two Python parsing systems using a molecular formula evaluator as my sample problem. There's even a video of my presentation.

    The presentation was based on a three-part series of articles I did developing a molecular formula parser using ANTLR. In part 3 I compare the ANTLR solution to a hand-written regular expression parser and solutions in PLY and PyParsing.

    The regexp and PLY solutions were first developed in a two-part series on two ways of writing parsers in Python.

    The regexp solution and base ANTLR/PLY/PyParsing solutions, use a regular expression like [A-Z][a-z]?\d* to match terms in the formula. This is what @David M suggested.

    Here is it worked out in Python

    import re
    
    # element_name is: capital letter followed by optional lower-case
    # count is: empty string (so the count is 1), or a set of digits
    element_pat = re.compile("([A-Z][a-z]?)(\d*)")
    
    all_elements = []
    for (element_name, count) in element_pat.findall("CH3COOH"):
        if count == "":
            count = 1
        else:
            count = int(count)
        all_elements.extend([element_name] * count)
    
    print all_elements
    

    When I run this (it's hard-coded to use acetic acid, CH3COOH) I get

    ['C', 'H', 'H', 'H', 'C', 'O', 'O', 'H']
    

    Do note that this short bit of code assumes the molecular formula is correct. If you give it something like "##$%^O2#$$#" then it will ignore the fields it doesn't know about and give ['O', 'O']. If you don't want that then you'll have to make it a bit more robust.

    If you want to support more complicated formulas, like C6H2(NO2)3CH3, then you'll need to know a bit about tree data structures, specifically (as @Roman points out), abstract syntax trees (most often called ASTs). That's too complicated to get into here, so see my talk and essays for more details.

提交回复
热议问题