Algorithms or Patterns for reading text

后端未结

关注

 3  1384

猫巷女王i 2020-12-18 04:10

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.

These companies email

3条回答

挽巷 (楼主)

2020-12-18 04:25

I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.

For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.

from pyparsing import *

aaa ="""    This is example text that could be many lines long...
             another line

    Location 1
    Product 1     Product 2     Product 3
    $20.99        $21.99        $33.79

    stuff in here you want to ignore

    Location 2
    Product 1     Product 2     Product 3
    $24.99        $22.88        $35.59 """

result = SkipTo("Location").suppress() \  
# in place of "location" could be any type of match like a re.
         + OneOrMore(Word(alphas) + Word(nums)) \
         + OneOrMore(Word(nums+"$.")) \

all_results = OneOrMore(Group(result))

parsed = all_results.parseString(aaa)

for block in parsed:
    print block

This returns a list of lists.

['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']

You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.

I do not know if there are equivalents in other languages.

0 讨论(0)

查看其它3个回答