Algorithms or Patterns for reading text

后端 未结 3 1384
猫巷女王i
猫巷女王i 2020-12-18 04:10

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.

These companies email

3条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-18 04:25

    I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.

    For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.

    from pyparsing import *
    
    aaa ="""    This is example text that could be many lines long...
                 another line
    
        Location 1
        Product 1     Product 2     Product 3
        $20.99        $21.99        $33.79
    
        stuff in here you want to ignore
    
        Location 2
        Product 1     Product 2     Product 3
        $24.99        $22.88        $35.59 """
    
    result = SkipTo("Location").suppress() \  
    # in place of "location" could be any type of match like a re.
             + OneOrMore(Word(alphas) + Word(nums)) \
             + OneOrMore(Word(nums+"$.")) \
    
    all_results = OneOrMore(Group(result))
    
    parsed = all_results.parseString(aaa)
    
    for block in parsed:
        print block
    

    This returns a list of lists.

    ['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
    ['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']
    

    You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.

    I do not know if there are equivalents in other languages.

提交回复
热议问题