How to match text in a cell to regex and keep only the text which matches regex?

后端未结

关注

 1  930

What I am trying to do: There is a large excel sheet with a lot haphazard customer information. I want to sort the email address and other data in a set format in a new exce

相关标签:

1条回答

失恋的感觉

2020-12-22 05:13
This code should work (I could only test the regex part though):
```
import sys, os, openpyxl
def sort_email_from_xl():
    sheet = sheet_select()   #Opens the worksheet
    emailRegex = re.compile(".*?([a-zA-Z0-9\._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,4}).*?")
    customeremails = []
    for row in range(0, max_row):
        if emailRegex.match(cell.text):
            mail = emailRegex.match(cell.text).groups()[0]
            cell.text = mail
            customeremails.append(mail)
    print(customeremails)
```
There were many problems with your code. First about the regex:
- the regex was not allowing text around your email address, added that with .*? at start and end
- you don't need the re.VERBOSE part as you'd only need it if you want to add inline comments to your regex, see doc
- you allowed email addresses with many @ in between
- you matched the TLD separately, that's unneeded
Now, the email regex works for basic usage, but I'd definitively recommend to take a proven email regex from other answers on Stackoverflow.

Then: with emailRegex.match(cell.text) you can check if the cell.text matches your regex and with emailRegex.match(cell.text).groups()[0] you extract only the matching part. You had one return statement too much as well.

For some reason the above code is giving me a NameError: name 'max_row' is not defined

You need to correct the looping through the rows e.g. like documented here
0 讨论(0)
发布评论:

提交评论
- 加载中...