问题
I have a list of strings and I want to split each string on a floating point number. If there is no floating point number in the string, I want to split it on a number. It should only split once and return everything before and after it separated by commas.
Input string:
['Naproxen  500  Active ingredient  Ph Eur',
 'Croscarmellose sodium  22.0 mg Disintegrant  Ph Eur',
 'Povidone K90  11.0   Binder 56 Ph Eur',
 'Water, purifieda,
 'Silica, colloidal anhydrous  2.62  Glidant  Ph Eur',
 'Water purified 49 Solvent  Ph Eur',
 'Magnesium stearate  1.38  Lubricant  Ph Eur']
Expected output:
['Naproxen',  '500',  'Active ingredient  Ph Eur',
 'Croscarmellose sodium',  '22.0 mg',  'Disintegrant  Ph Eur',
 'Povidone K90',  '11.0',  'Binder  Ph Eur',
 'Water, purifieda',
 'Silica, colloidal anhydrous',  '2.62',  'Glidant  Ph Eur',
 'Water purified', '49',  'Solvent  Ph Eur',
 'Magnesium stearate',  '1.38',  'Lubricant  Ph Eur']
My code:
for i in newresult:
        regex_float_part = re.split(r'\s+(\d+\.\d+)\s+', i, 1)
#        print(regex_float_part)
#        regex_float_part_n = [item for sublist in regex_float_part for item in sublist]
        if regex_float_part:
            all_extract.append(regex_float_part)
        else:
#            regex_integer = r'\s+(\d+(?:\\d+)?)\s+'
            regex_integer_part = re.split(r'\s+(\d+(?:\\d+)?)\s+', i, 1)
#            regex_integer_part_n = [item for sublist in regex_integer_part for item in sublist]
            all_extract.append(regex_integer_part)
The issue is with this input string:
'Water purified 49 Solvent  Ph Eur',
This is not coming as expected which is:
'Water purified', '49',  'Solvent  Ph Eur'
that is the code is not going into the else part. One observation is that my regex's split function is creating a list of lists, i.e regex_float_part and regex_integer_part are list of lists. Can anyone please help me solve this for the string which my code is not responding to
回答1:
You're regex is almost correct but you have to take in consideration that the . and the digits after the dot might not be there. This can be achieved like this:
\s+(\d+(?:\.\d+)?)\s+
The difference is that you add the \.\d+ in a non-capturing group (?:xxxx) that might be there or not be there by using the question mark after the group: (?:xxxx)?
回答2:
I suggest using
res = re.match(r'^(?:(?!.*\d\.\d)(.*?)\s*\b(\d+(?:\s*mg)?)\b\s*(.*)|((?:(?!\d+\.\d).)*?)\s*\b(\d+\.\d+(?:\s*mg)?)\b\s*(.*))$', i)
if res:
    all_extract.append(list(filter(None, res.groups())))
See the regex demo.
Full Python demo without commented code:
import re
def show():
    newresult = ['Naproxen  500  Active ingredient  Ph Eur','Croscarmellose sodium  22.0 mg Disintegrant  Ph Eur','Povidone K90  11.0   Binder 56 Ph Eur','Water, purifieda','Silica, colloidal anhydrous  2.62  Glidant  Ph Eur','Water purified 49 Solvent  Ph Eur','Magnesium stearate  1.38  Lubricant  Ph Eur']
    all_extract = []
    for i in newresult:
        res = re.match(r'^(?:(?!.*\d\.\d)(.*?)\s*\b(\d+(?:\s*mg)?)\b\s*(.*)|((?:(?!\d+\.\d).)*?)\s*\b(\d+\.\d+(?:\s*mg)?)\b\s*(.*))$', i)
        if res:
            all_extract.append(list(filter(None, res.groups())))
        else:
            print("ONLY INTEGER")
            regex_integer_part = re.split(r'\s+(\d+(?:\.\d+)?)\s+', i, 1)
            all_extract.append(regex_integer_part)
    return all_extract
print(show())
yields
[['Naproxen', '500', 'Active ingredient  Ph Eur'], ['Croscarmellose sodium', '22.0 mg', 'Disintegrant  Ph Eur'], ['Povidone K90', '11.0', 'Binder 56 Ph Eur'], ['Water, purifieda'], ['Silica, colloidal anhydrous', '2.62', 'Glidant  Ph Eur'], ['Water purified', '49', 'Solvent  Ph Eur'], ['Magnesium stearate', '1.38', 'Lubricant  Ph Eur']]
来源:https://stackoverflow.com/questions/60057011/get-all-values-before-a-decimal-number-an-integer-from-a-list-of-strings-in-py