What I am trying to do: There is a large excel sheet with a lot haphazard customer information. I want to sort the email address and other data in a set format in a new exce
This code should work (I could only test the regex part though):
import sys, os, openpyxl
def sort_email_from_xl():
sheet = sheet_select() #Opens the worksheet
emailRegex = re.compile(".*?([a-zA-Z0-9\._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,4}).*?")
customeremails = []
for row in range(0, max_row):
if emailRegex.match(cell.text):
mail = emailRegex.match(cell.text).groups()[0]
cell.text = mail
customeremails.append(mail)
print(customeremails)
There were many problems with your code. First about the regex:
.*?
at start and endre.VERBOSE
part as you'd only need it if you want to add inline comments to your regex, see doc@
in betweenNow, the email regex works for basic usage, but I'd definitively recommend to take a proven email regex from other answers on Stackoverflow.
Then: with emailRegex.match(cell.text)
you can check if the cell.text
matches your regex and with emailRegex.match(cell.text).groups()[0]
you extract only the matching part. You had one return
statement too much as well.
For some reason the above code is giving me a NameError: name 'max_row' is not defined
You need to correct the looping through the rows e.g. like documented here