How to match text in a cell to regex and keep only the text which matches regex?

后端 未结 1 927
被撕碎了的回忆
被撕碎了的回忆 2020-12-22 05:00

What I am trying to do: There is a large excel sheet with a lot haphazard customer information. I want to sort the email address and other data in a set format in a new exce

相关标签:
1条回答
  • 2020-12-22 05:13

    This code should work (I could only test the regex part though):

    import sys, os, openpyxl
    def sort_email_from_xl():
        sheet = sheet_select()   #Opens the worksheet
        emailRegex = re.compile(".*?([a-zA-Z0-9\._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,4}).*?")
        customeremails = []
        for row in range(0, max_row):
            if emailRegex.match(cell.text):
                mail = emailRegex.match(cell.text).groups()[0]
                cell.text = mail
                customeremails.append(mail)
        print(customeremails)
    

    There were many problems with your code. First about the regex:

    • the regex was not allowing text around your email address, added that with .*? at start and end
    • you don't need the re.VERBOSE part as you'd only need it if you want to add inline comments to your regex, see doc
    • you allowed email addresses with many @ in between
    • you matched the TLD separately, that's unneeded

    Now, the email regex works for basic usage, but I'd definitively recommend to take a proven email regex from other answers on Stackoverflow.

    Then: with emailRegex.match(cell.text) you can check if the cell.text matches your regex and with emailRegex.match(cell.text).groups()[0] you extract only the matching part. You had one return statement too much as well.

    For some reason the above code is giving me a NameError: name 'max_row' is not defined

    You need to correct the looping through the rows e.g. like documented here

    0 讨论(0)
提交回复
热议问题