How to retrieve Python format code from number represented as string?

问题

I have numeric data stored in ASCII txt files, i.e. values of different parameters with a column for each parameter. The format might be different between columns but does not change within a column. I load that data into Python, process it and write it back to ASCII files. The thing is: the format of the numbers should not change. Meaning that decimal places should still be the same, exp notation should still be exp notation and so on. So what I need is a function that returns format codes for each string that represents a number (which I can then store alongside the numbers during processing). Note: parameter types won't change during processing; i.e. integers will stay integers, floats stay floats etc. (otherwise, keeping the format code woudn't make much sense).

My idea would be to use regex to analyse the string, to determine if it is an int, float, float in exponential notation etc.:

import re
string = '3.142'
# positive match then detected as
match = re.fullmatch(r'[+|-]*[0-9]+[.][0-9]*', string.strip())

Following this general classification, I'd parse the string to determine e.g. decimal places. For example

string = '3.142' # I know from above that it is a float and not exp notation...
lst = string.strip().split('.')
if not lst[1]: # trailing zeros are hidden
    result = '{:+g}' if '+' in lst[0] else '{:g}'
else:
    result = '{0:+.' if '+' in lst[0] else '{0:.'
    result += str(len(lst[1])) + 'f}'

print(result) # gives... '{0:.3f}'

Q: This seems like a rather clumsy approach; - Anybody have a better solution?

回答1:

My answer to my own question, after thinking about the issue for some time: It is kind of an impossible inversion due to lacking information.

Example. Suppose you read a string '-5.5'. Then you already lack the information if the number has 1 digit of precision or if trailing zeros are just hidden. Another (non-numeric) issue would be that you don’t know if it is a 'signed' value, i.e. if it would be '+5.5' if it was a positive number. Want more? Take '1.2E+1' for example. This could have been integer 12. Although unlikely, you can’t be sure.

Besides that there are some minor limitations on the Python side, like e.g. as far as I know, {:E}.format() will always generate a signed, zero-padded, 2-digit exponent (if it is <100 of course), i.e. like '...E+01' although you might want '...E+1'. Another thing about number formatting are hidden leading and trailing zeros, see e.g. my question here. Removing leading/trailing zeros just seems not to be included in normal string formatting options – you need additional helpers like .lstrip(“0”).

What I came up with that does at least a decent job in returning format codes to go from string to number and back to string. Uses a little bit of regex for a general classification and then simple .split() etc.

import re
class NumStr():
    def analyse_format(self, s, dec_sep='.'):
        """
        INPUT: 
            s, string, representing a number
        INPUT, optional: 
            dec_sep, string, decimal separator
        WHAT IT DOES:
            1) analyse the string to achieve a general classification
                (decimal, no decimal, exp notation)
            2) pass the string and the general class to an appropriate
                parsing function.
        RETURNS: 
            the result of the parsing function:
                tuple with
                    format code to be used in '{}.format()'
                    suited Python type for the number, int or float.
        """
        # 1. format definitions. key = general classification.
        redct = {'dec': '[+-]?[0-9]+['+dec_sep+'][0-9]*|[+-]?[0-9]*['+dec_sep+'][0-9]+',
                 'no_dec': '[+-]?[0-9]+',
                 'exp_dec': '[+-]?[0-9]+['+dec_sep+'][0-9]*[eE][+-]*[0-9]+',
                 'exp_no_dec': '[+-]?[0-9]+[eE][+-]*[0-9]+'}
        # 2. analyse the format to find the general classification.
        gen_class, s = [], s.strip()
        for k, v in redct.items():
            test = re.fullmatch(v, s)
            if test:
                gen_class.append(k)
        if not gen_class:
            raise TypeError("unknown format -->", s)
        elif len(gen_class) > 1:
            raise TypeError("ambiguous result -->", s, gen_class)
        # 3. based on the general classification, call string parsing function
        method_name = 'parse_' + str(gen_class[0])
        method = getattr(self, method_name, lambda *args: "Undefined Format!")
        return method(s, *dec_sep)

    def parse_dec(self, s, dec_sep):
        lst = s.split(dec_sep)
        result = '{:f}' if len(lst[1]) == 0 else '{:.'+str(len(lst[1]))+'f}'
        result = result.replace(':', ':+') if '+' in lst[0] else result
        return (result, float)

    def parse_no_dec(self, s, *dec_sep):
        result = '{:+d}' if '+' in s else '{:d}'
        return (result, int)

    def parse_exp_dec(self, s, dec_sep):
        lst_dec = s.split(dec_sep)
        lst_E = lst_dec[1].upper().split('E')
        result = '{:.'+str(len(lst_E[0]))+'E}'
        result = result.replace(':', ':+') if '+' in lst_dec[0] else result
        return (result, float)

    def parse_exp_no_dec(self, s, *dec_sep):
        lst_E = s.upper().split('E')
        result = '{:+E}' if '+' in lst_E[0] else '{:E}'
        return (result, float)

and for testing:

valid = ['45', '45.', '3E5', '4E+5', '3E-3', '2.345E+7', '-7',
         '-45.3', '-3.4E3', ' 12 ', '8.8E1', '+5.3', '+4.',
         '+10', '+2.3E121', '+4e-3','-204E-9668','.7','+.7']
invalid = ['tesT', 'Test45', '7,7E2', '204-100', '.']

If you have any ideas for improvement, I'm happy to include them! I guess people already came across this issue.

来源：https://stackoverflow.com/questions/57125678/how-to-retrieve-python-format-code-from-number-represented-as-string

标签

python

string

numbers

format