Python split string without splitting escaped character

后端 未结 10 1370
谎友^
谎友^ 2020-12-08 20:56

Is there a way to split a string without splitting escaped character? For example, I have a string and want to split by \':\' and not by \'\\:\'

http\\://ww         


        
相关标签:
10条回答
  • 2020-12-08 21:34

    I really know this is an old question, but i needed recently an function like this and not found any that was compliant with my requirements.

    Rules:

    • Escape char only works when used with escape char or delimiter. Ex. if delimiter is / and escape are \ then (\a\b\c/abc bacame ['\a\b\c', 'abc']
    • Multiple escapes chars will be escaped. (\\ became \)

    So, for the record and if someone look anything like, here my function proposal:

    def str_escape_split(str_to_escape, delimiter=',', escape='\\'):
        """Splits an string using delimiter and escape chars
    
        Args:
            str_to_escape ([type]): The text to be splitted
            delimiter (str, optional): Delimiter used. Defaults to ','.
            escape (str, optional): The escape char. Defaults to '\'.
    
        Yields:
            [type]: a list of string to be escaped
        """
        if len(delimiter) > 1 or len(escape) > 1:
            raise ValueError("Either delimiter or escape must be an one char value")
        token = ''
        escaped = False
        for c in str_to_escape:
            if c == escape:
                if escaped:
                    token += escape
                    escaped = False
                else:
                    escaped = True
                continue
            if c == delimiter:
                if not escaped:
                    yield token
                    token = ''
                else:
                    token += c
                    escaped = False
            else:
                if escaped:
                    token += escape
                    escaped = False
                token += c
        yield token
    

    For the sake of sanity, i'm make some tests:

    # The structure is:
    # 'string_be_split_escaped', [list_with_result_expected]
    tests_slash_escape = [
        ('r/casa\\/teste/g', ['r', 'casa/teste', 'g']),
        ('r/\\/teste/g', ['r', '/teste', 'g']),
        ('r/(([0-9])\\s+-\\s+([0-9]))/\\g<2>\\g<3>/g',
         ['r', '(([0-9])\\s+-\\s+([0-9]))', '\\g<2>\\g<3>', 'g']),
        ('r/\\s+/ /g', ['r', '\\s+', ' ', 'g']),
        ('r/\\.$//g', ['r', '\\.$', '', 'g']),
        ('u///g', ['u', '', '', 'g']),
        ('s/(/[/g', ['s', '(', '[', 'g']),
        ('s/)/]/g', ['s', ')', ']', 'g']),
        ('r/(\\.)\\1+/\\1/g', ['r', '(\\.)\\1+', '\\1', 'g']),
        ('r/(?<=\\d) +(?=\\d)/./', ['r', '(?<=\\d) +(?=\\d)', '.', '']),
        ('r/\\\\/\\\\\\/teste/g', ['r', '\\', '\\/teste', 'g'])
    ]
    
    tests_bar_escape = [
        ('r/||/|||/teste/g', ['r', '|', '|/teste', 'g'])
    ]
    
    def test(test_array, escape):
        """From input data, test escape functions
    
        Args:
            test_array ([type]): [description]
            escape ([type]): [description]
        """
        for t in test_array:
            resg = str_escape_split(t[0], '/', escape)
            res = list(resg)
            if res == t[1]:
                print(f"Test {t[0]}: {res} - Pass!")
            else:
                print(f"Test {t[0]}: {t[1]} != {res} - Failed! ")
    
    
    def test_all():
        test(tests_slash_escape, '\\')
        test(tests_bar_escape, '|')
    
    
    if __name__ == "__main__":
        test_all()
    
    0 讨论(0)
  • 2020-12-08 21:38

    As Ignacio says, yes, but not trivially in one go. The issue is that you need lookback to determine if you're at an escaped delimiter or not, and the basic string.split doesn't provide that functionality.

    If this isn't inside a tight loop so performance isn't a significant issue, you can do it by first splitting on the escaped delimiters, then performing the split, and then merging. Ugly demo code follows:

    # Bear in mind this is not rigorously tested!
    def escaped_split(s, delim):
        # split by escaped, then by not-escaped
        escaped_delim = '\\'+delim
        sections = [p.split(delim) for p in s.split(escaped_delim)] 
        ret = []
        prev = None
        for parts in sections: # for each list of "real" splits
            if prev is None:
                if len(parts) > 1:
                    # Add first item, unless it's also the last in its section
                    ret.append(parts[0])
            else:
                # Add the previous last item joined to the first item
                ret.append(escaped_delim.join([prev, parts[0]]))
            for part in parts[1:-1]:
                # Add all the items in the middle
                ret.append(part)
            prev = parts[-1]
        return ret
    
    s = r'http\://www.example.url:ftp\://www.example.url'
    print (escaped_split(s, ':')) 
    # >>> ['http\\://www.example.url', 'ftp\\://www.example.url']
    

    Alternately, it might be easier to follow the logic if you just split the string by hand.

    def escaped_split(s, delim):
        ret = []
        current = []
        itr = iter(s)
        for ch in itr:
            if ch == '\\':
                try:
                    # skip the next character; it has been escaped!
                    current.append('\\')
                    current.append(next(itr))
                except StopIteration:
                    pass
            elif ch == delim:
                # split! (add current to the list and reset it)
                ret.append(''.join(current))
                current = []
            else:
                current.append(ch)
        ret.append(''.join(current))
        return ret
    

    Note that this second version behaves slightly differently when it encounters double-escapes followed by a delimiter: this function allows escaped escape characters, so that escaped_split(r'a\\:b', ':') returns ['a\\\\', 'b'], because the first \ escapes the second one, leaving the : to be interpreted as a real delimiter. So that's something to watch out for.

    0 讨论(0)
  • 2020-12-08 21:38

    I have created this method, which is inspired by Henry Keiter's answer, but has the following advantages:

    • Variable escape character and delimiter
    • Do not remove the escape character if it is actually not escaping something

    This is the code:

    def _split_string(self, string: str, delimiter: str, escape: str) -> [str]:
        result = []
        current_element = []
        iterator = iter(string)
        for character in iterator:
            if character == self.release_indicator:
                try:
                    next_character = next(iterator)
                    if next_character != delimiter and next_character != escape:
                        # Do not copy the escape character if it is inteded to escape either the delimiter or the
                        # escape character itself. Copy the escape character if it is not in use to escape one of these
                        # characters.
                        current_element.append(escape)
                    current_element.append(next_character)
                except StopIteration:
                    current_element.append(escape)
            elif character == delimiter:
                # split! (add current to the list and reset it)
                result.append(''.join(current_element))
                current_element = []
            else:
                current_element.append(character)
        result.append(''.join(current_element))
        return result
    

    This is test code indicating the behavior:

    def test_split_string(self):
        # Verify normal behavior
        self.assertListEqual(['A', 'B'], list(self.sut._split_string('A+B', '+', '?')))
    
        # Verify that escape character escapes the delimiter
        self.assertListEqual(['A+B'], list(self.sut._split_string('A?+B', '+', '?')))
    
        # Verify that the escape character escapes the escape character
        self.assertListEqual(['A?', 'B'], list(self.sut._split_string('A??+B', '+', '?')))
    
        # Verify that the escape character is just copied if it doesn't escape the delimiter or escape character
        self.assertListEqual(['A?+B'], list(self.sut._split_string('A?+B', '\'', '?')))
    
    0 讨论(0)
  • 2020-12-08 21:43

    I think a simple C like parsing would be much more simple and robust.

    def escaped_split(str, ch):
        if len(ch) > 1:
            raise ValueError('Expected split character. Found string!')
        out = []
        part = ''
        escape = False
        for i in range(len(str)):
            if not escape and str[i] == ch:
                out.append(part)
                part = ''
            else:
                part += str[i]
                escape = not escape and str[i] == '\\'
        if len(part):
            out.append(part)
        return out
    
    0 讨论(0)
提交回复
热议问题