regular expression on stream instead of string?

后端 未结 3 529
深忆病人
深忆病人 2020-12-19 08:25

Suppose you want to do regular expression search and extract over a pipe, but the pattern may cross multiple lines, How to do it? Maybe a regular expression library work for

相关标签:
3条回答
  • 2020-12-19 09:02

    I solved a similar problem for searching a stream using classic pattern matching. You may want to subclass the Matcher class of my solution streamsearch-py and perform regex matching in the buffer. Check out the included kmp_example.py below for a template. If it turns out classic Knuth-Morris-Pratt matching is all you need, then your problem would be solved right now with this little open source library :-)

    #!/usr/bin/env python
    
    # Copyright 2014-2015 @gitagon. For alternative licenses contact the author.
    # 
    # This file is part of streamsearch-py.
    # streamsearch-py is free software: you can redistribute it and/or modify
    # it under the terms of the GNU Affero General Public License as published by
    # the Free Software Foundation, either version 3 of the License, or
    # (at your option) any later version.
    # 
    # streamsearch-py is distributed in the hope that it will be useful,
    # but WITHOUT ANY WARRANTY; without even the implied warranty of
    # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    # GNU Affero General Public License for more details.
    # You should have received a copy of the GNU Affero General Public License
    # along with streamsearch-py.  If not, see <http://www.gnu.org/licenses/>.
    
    
    from streamsearch.matcher_kmp import MatcherKMP
    from streamsearch.buffer_reader import BufferReader
    
    class StringReader():
        """for providing an example read() from string required by BufferReader"""
        def __init__(self, string):
            self.s = string
            self.i = 0
    
        def read(self, buf, cnt):
            if self.i >= len(self.s): return -1
            r = self.s[self.i]
            buf[0] = r
            result = 1
            print "read @%s" % self.i, chr(r), "->", result
            self.i+=1
            return result
    
    def main():
    
        w = bytearray("abbab")
        print "pattern of length %i:" % len(w), w
        s = bytearray("aabbaabbabababbbc")
        print "text:", s
        m = MatcherKMP(w)
        r = StringReader(s)
        b = BufferReader(r.read, 200)
        m.find(b)
        print "found:%s, pos=%s " % (m.found(), m.get_index())
    
    
    if __name__ == '__main__':
        main()
    

    output is

    pattern of length 5: abbab
    text: aabbaabbabababbbc
    read @0 a -> 1
    read @1 a -> 1
    read @2 b -> 1
    read @3 b -> 1
    read @4 a -> 1
    read @5 a -> 1
    read @6 b -> 1
    read @7 b -> 1
    read @8 a -> 1
    read @9 b -> 1
    found:True, pos=5 
    
    0 讨论(0)
  • 2020-12-19 09:12

    I do not believe that it is possible to use a regular expression on a stream, because without an entire piece of data, you cant make a positive match. This means that you would only have a probable match.

    However, as @James Henstridge stated, you could use buffers to overcome this.

    0 讨论(0)
  • 2020-12-19 09:18

    If you are after a general solution, your algorithm would need to look something like:

    1. Read a chunk of the stream into a buffer.
    2. Search for the regexp in the buffer
    3. If the pattern matches, do whatever you want with the match, discard the start of the buffer up to match.end() and go to step 2.
    4. If the pattern does not match, extend the buffer with more data from the stream

    This could end up using a lot of memory if no matches are found, but it is difficult to do better in the general case (consider trying to match .*x as a multi-line regexp in a large file where the only x is the last character).

    If you know more about the regexp, you might have other cases where you can discard part of the buffer.

    0 讨论(0)
提交回复
热议问题