Split a string consisting of letters and numbers into parts

旧街凉风 提交于 2021-02-07 08:16:05

问题


I have a string consisting of alternating strings of lower-cased letters and numbers (integers or floats), which is of arbitrary length, and I wish to split it into parts, each of maximal possible size, such that a part will consist of either a string or a (string representing a) number.

I don't need to regard special forms of numbers, such as exponents, hexadecimal, etc.; just simple floating point or an integer.

A few examples:

>>> split("")
()
>>> split("p")
('p',)
>>> split("2")
('2',)
>>> split("a2b3")
('a', '2', 'b', '3')
>>> split("a2.1b3")
('a', '2.1', 'b', '3')
>>> split("a.1b3")
('a', '.1', 'b', '3')

However, the following calls should raise some error:

>>> split(3)
>>> split("a0.10.2")
>>> split("ab.c")

My first attempt was using re.split. However, this attempt is quite naive, and it does not save the delimiters, in case I make these letters:

>>> re.split("[a-z]", "a.1b3")
['', '.1', '3']

My second attempt was using itertools.groupby. The problem is that it does not care about the form of the number, so, for example:

>>> islowalpha = labmda s: str.isalpha(s) and str.islower(s)
>>> [''.join(g) for _, g in itertools.groupby("a0.10.2b", islowalpha)]  # should raise
['a', '0.10.2', 'b']

Note: I don't care about the form of the output, as long as it is iterable.

Note: I've read this, but I could not adapt the solution to my problem. The main difference is that I need to allow only acceptable numbers, and not a simple list of digits and points.


回答1:


import re

def split_gen(x):
    for f, s in re.findall(r'([\d.]+)|([^\d.]+)', x):
        if f:
            float(f)
            yield f
        else:
            yield s

def split(x):
    '''
    >>> split("")
    ()
    >>> split("p")
    ('p',)
    >>> split("2")
    ('2',)
    >>> split("a2b3")
    ('a', '2', 'b', '3')
    >>> split("a2.1b3")
    ('a', '2.1', 'b', '3')
    >>> split("a.1b3")
    ('a', '.1', 'b', '3')
    >>> split(3)
    Traceback (most recent call last):
    ...
    TypeError: expected string or buffer
    >>> split("a0.10.2")
    Traceback (most recent call last):
    ...
    ValueError: could not convert string to float: '0.10.2'
    >>> split("ab.c")    
    Traceback (most recent call last):
    ...
    ValueError: could not convert string to float: '.'
    '''
    return tuple(split_gen(x))

if __name__ == '__main__':
    import doctest
    doctest.testmod()



回答2:


A bit of play with re.sub and itertools.cycle:

def split(s):
    res = []

    def replace(matchobj):
        res.append(matchobj.group(0))
        return ''

    letter = re.compile('^([a-z]+)')
    number = re.compile('^(\.\d|\d+\.\d+|\d+)')

    if letter.match(s):
        c = itertools.cycle([letter, number])
    else:
        c = itertools.cycle([number, letter])

    for op in c:
        mods = op.sub(replace, s)
        if len(s) == len(mods):
            return
        elif not mods:
            return res
        s = mods

The basic idea - create two alternating re patterns and try to match the input string with them.

A demo with some of your examples:

>>> split("2")
['2']
>>> split("a2b3")
['a', '2', 'b', '3']
>>> split("a.1b3")
['a', '.1', 'b', '3']
>>> split("a0.10.2")
>>> split("ab.c")



回答3:


The problem is that the premise of the your question is plausible. How can you distinguish floats from an arbitrary string? There are a lot of ways to interpret. For example,
0.10.2 This can mean 0.1, 0.2. or 0, .10, .2
what if the number is 27.6734.98? You need to specify what kind of number and what format it will be first. Ex: every number only has one digit beside decimals.




回答4:


import re

string = 'a.2b3c4.5d'

REG_STR = r'([a-zA-Z])|(\.\d+)|(\d+\.\d+)|(\d+)'
matches = [m.group() for m in re.finditer(REG_STR, string) if re.finditer(REG_STR, string)]


来源:https://stackoverflow.com/questions/22863430/split-a-string-consisting-of-letters-and-numbers-into-parts

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!