Algorithms or Patterns for reading text

后端未结

关注

 3  1382

My company has a client that tracks prices for products from different companies at different locations. This information goes into a database.

These companies email

相关标签:

3条回答

挽巷

2020-12-18 04:25

I think this problem would be suitable for proper parser generator. Regular expressions are too difficult to test and debug if they go wrong. However, I would go for a parser generator that is simple to use as if it was part of a language.

For these type of tasks I would go with pyparsing as its got the power of a full lr parser but without a difficult grammer to define and very good helper functions. The code is easy to read too.

from pyparsing import *

aaa ="""    This is example text that could be many lines long...
             another line

    Location 1
    Product 1     Product 2     Product 3
    $20.99        $21.99        $33.79

    stuff in here you want to ignore

    Location 2
    Product 1     Product 2     Product 3
    $24.99        $22.88        $35.59 """

result = SkipTo("Location").suppress() \  
# in place of "location" could be any type of match like a re.
         + OneOrMore(Word(alphas) + Word(nums)) \
         + OneOrMore(Word(nums+"$.")) \

all_results = OneOrMore(Group(result))

parsed = all_results.parseString(aaa)

for block in parsed:
    print block

This returns a list of lists.

['Location', '1', 'Product', '1', 'Product', '2', 'Product', '3', '$20.99', '$21.99', '$33.79']
['Location', '2', 'Product', '1', 'Product', '2', 'Product', '3', '$24.99', '$22.88', '$35.59']

You can group things as you want but for simplicity I have just returned lists. Whitespace is ignored by default which makes things a lot simpler.

I do not know if there are equivalents in other languages.

0 讨论(0)

北恋

2020-12-18 04:25
You have given two pattern samples for text files.
I think these can be handled with scripting.
Something like: AWK, sed, grep with bash scripting.

One pattern in the first sample,
1. Section starts with keyword Location [Number]
  - second line of section has columns describing product names
  - third line of section has columns with prices for the products
There can be variable number of products per section.
There can be variable number of sections per file.
Products and prices are always on their designated lines of a section.
Whitespace separation identifies the (product,price) column-association.
Number of products in a section matches the number of prices in that section.

The collected data would probably be assimilated in a database.
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2020-12-18 04:38

The one thing I know I would use here is regular expressions. Three or four expressions could drive the parse logic for each e-mail format.

Trying to write the parse engine more generally than that would, I think, be skirting the edge of overprogramming it.

0 讨论(0)
发布评论:

提交评论
- 加载中...