Algorithms or Patterns for reading text

后端 未结 3 1382
猫巷女王i
猫巷女王i 2020-12-18 04:10

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.

These companies email

相关标签:
3条回答
  • 2020-12-18 04:25

    I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.

    For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.

    from pyparsing import *
    
    aaa ="""    This is example text that could be many lines long...
                 another line
    
        Location 1
        Product 1     Product 2     Product 3
        $20.99        $21.99        $33.79
    
        stuff in here you want to ignore
    
        Location 2
        Product 1     Product 2     Product 3
        $24.99        $22.88        $35.59 """
    
    result = SkipTo("Location").suppress() \  
    # in place of "location" could be any type of match like a re.
             + OneOrMore(Word(alphas) + Word(nums)) \
             + OneOrMore(Word(nums+"$.")) \
    
    all_results = OneOrMore(Group(result))
    
    parsed = all_results.parseString(aaa)
    
    for block in parsed:
        print block
    

    This returns a list of lists.

    ['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
    ['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']
    

    You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.

    I do not know if there are equivalents in other languages.

    0 讨论(0)
  • 2020-12-18 04:25

    You have given two pattern samples for text files.
    I think these can be handled with scripting.
    Something like: AWK, sed, grep with bash scripting.


    One pattern in the first sample,

    1. Section starts with keyword Location [Number]
      • second line of section has columns describing product names
      • third line of section has columns with prices for the products

    There can be variable number of products per section.
    There can be variable number of sections per file.
    Products and prices are always on their designated lines of a section.
    Whitespace separation identifies the (product,price) column-association.
    Number of products in a section matches the number of prices in that section.


    The collected data would probably be assimilated in a database.

    0 讨论(0)
  • 2020-12-18 04:38

    The one thing I know I would use here is regular expressions. Three or four expressions could drive the parse logic for each e-mail format.

    Trying to write the parse engine more generally than that would, I think, be skirting the edge of overprogramming it.

    0 讨论(0)
提交回复
热议问题