Get all values before a decimal number / an integer from a list of strings in Python

给你一囗甜甜゛ 提交于 2020-03-04 05:05:10

问题


I have a list of strings and I want to split each string on a floating point number. If there is no floating point number in the string, I want to split it on a number. It should only split once and return everything before and after it separated by commas.

Input string:

['Naproxen  500  Active ingredient  Ph Eur',
 'Croscarmellose sodium  22.0 mg Disintegrant  Ph Eur',
 'Povidone K90  11.0   Binder 56 Ph Eur',
 'Water, purifieda,
 'Silica, colloidal anhydrous  2.62  Glidant  Ph Eur',
 'Water purified 49 Solvent  Ph Eur',
 'Magnesium stearate  1.38  Lubricant  Ph Eur']

Expected output:

['Naproxen',  '500',  'Active ingredient  Ph Eur',
 'Croscarmellose sodium',  '22.0 mg',  'Disintegrant  Ph Eur',
 'Povidone K90',  '11.0',  'Binder  Ph Eur',
 'Water, purifieda',
 'Silica, colloidal anhydrous',  '2.62',  'Glidant  Ph Eur',
 'Water purified', '49',  'Solvent  Ph Eur',
 'Magnesium stearate',  '1.38',  'Lubricant  Ph Eur']

My code:

for i in newresult:
        regex_float_part = re.split(r'\s+(\d+\.\d+)\s+', i, 1)
#        print(regex_float_part)
#        regex_float_part_n = [item for sublist in regex_float_part for item in sublist]
        if regex_float_part:
            all_extract.append(regex_float_part)
        else:
#            regex_integer = r'\s+(\d+(?:\\d+)?)\s+'
            regex_integer_part = re.split(r'\s+(\d+(?:\\d+)?)\s+', i, 1)
#            regex_integer_part_n = [item for sublist in regex_integer_part for item in sublist]


            all_extract.append(regex_integer_part)

The issue is with this input string:

'Water purified 49 Solvent  Ph Eur',

This is not coming as expected which is:

'Water purified', '49',  'Solvent  Ph Eur'

that is the code is not going into the else part. One observation is that my regex's split function is creating a list of lists, i.e regex_float_part and regex_integer_part are list of lists. Can anyone please help me solve this for the string which my code is not responding to


回答1:


You're regex is almost correct but you have to take in consideration that the . and the digits after the dot might not be there. This can be achieved like this:

\s+(\d+(?:\.\d+)?)\s+

The difference is that you add the \.\d+ in a non-capturing group (?:xxxx) that might be there or not be there by using the question mark after the group: (?:xxxx)?




回答2:


I suggest using

res = re.match(r'^(?:(?!.*\d\.\d)(.*?)\s*\b(\d+(?:\s*mg)?)\b\s*(.*)|((?:(?!\d+\.\d).)*?)\s*\b(\d+\.\d+(?:\s*mg)?)\b\s*(.*))$', i)
if res:
    all_extract.append(list(filter(None, res.groups())))

See the regex demo.

Full Python demo without commented code:

import re

def show():
    newresult = ['Naproxen  500  Active ingredient  Ph Eur','Croscarmellose sodium  22.0 mg Disintegrant  Ph Eur','Povidone K90  11.0   Binder 56 Ph Eur','Water, purifieda','Silica, colloidal anhydrous  2.62  Glidant  Ph Eur','Water purified 49 Solvent  Ph Eur','Magnesium stearate  1.38  Lubricant  Ph Eur']
    all_extract = []
    for i in newresult:
        res = re.match(r'^(?:(?!.*\d\.\d)(.*?)\s*\b(\d+(?:\s*mg)?)\b\s*(.*)|((?:(?!\d+\.\d).)*?)\s*\b(\d+\.\d+(?:\s*mg)?)\b\s*(.*))$', i)
        if res:
            all_extract.append(list(filter(None, res.groups())))
        else:
            print("ONLY INTEGER")
            regex_integer_part = re.split(r'\s+(\d+(?:\.\d+)?)\s+', i, 1)
            all_extract.append(regex_integer_part)
    return all_extract

print(show())

yields

[['Naproxen', '500', 'Active ingredient Ph Eur'], ['Croscarmellose sodium', '22.0 mg', 'Disintegrant Ph Eur'], ['Povidone K90', '11.0', 'Binder 56 Ph Eur'], ['Water, purifieda'], ['Silica, colloidal anhydrous', '2.62', 'Glidant Ph Eur'], ['Water purified', '49', 'Solvent Ph Eur'], ['Magnesium stearate', '1.38', 'Lubricant Ph Eur']]



来源:https://stackoverflow.com/questions/60057011/get-all-values-before-a-decimal-number-an-integer-from-a-list-of-strings-in-py

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!