问题
I have a 2 set of data i crawled from a html table using regex expression
data:
<div class = "info">
<div class="name"><td>random</td></div>
<div class="hp"><td>123456</td></div>
<div class="email"><td>random@mail.com</td></div>
</div>
<div class = "info">
<div class="name"><td>random123</td></div>
<div class="hp"><td>654321</td></div>
<div class="email"><td>random123@mail.com</td></div>
</div>
regex:
matchname = re.search('\<div class="name"><td>(.*?)</td>' , match3).group(1)
matchhp = re.search('\<div class="hp"><td>(.*?)</td>' , match3).group(1)
matchemail = re.search('\<div class="email"><td>(.*?)</td>' , match3).group(1)
so using the regex i can take out
random
123456
random@mail.com
so after saving this set of data into my database i want to save the next set how do i get the next set of data? i tried using findall then insert into my db but everything was in 1 line. I need the data to be in the db set by set.
New to python please comment on which part is unclear will try to edit
回答1:
You should not be parsing HTML with regex. It's just a mess, do it with BS4. Doing it the right way:
soup = BeautifulSoup(match3, "html.parser")
names = []
allTds = soup.find_all("td")
for i,item in enumerate(allTds[::3]):
# firstname hp email
names.append((item.text, allTds[(i*3)+1].text, allTds[(i*3)+2].text))
And for the sake of answering the question asked I guess I'll include a horrible ugly regex that you should never use. ESPECIALLY because it's html, don't ever use regex for parsing html. (please don't use this)
for thisMatch in re.findall(r"<td>(.+?)</td>.+?<td>(.+?)</td>.+?<td>(.+?)</td>", match3, re.DOTALL):
print(thisMatch[0], thisMatch[1], thisMatch[2])
回答2:
As @Racialz pointed out, you should look into using HTML parsers instead of regular expressions.
Let's take BeautifulSoup as well as @Racialz did, but build a more robust solution. Find all info
elements and locate all fields inside producing a list of dictionaries in the output:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<div>
<div class = "info">
<div class="name"><td>random</td></div>
<div class="hp"><td>123456</td></div>
<div class="email"><td>random@mail.com</td></div>
</div>
<div class = "info">
<div class="name"><td>random123</td></div>
<div class="hp"><td>654321</td></div>
<div class="email"><td>random123@mail.com</td></div>
</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
fields = ["name", "hp", "email"]
result = [
{field: info.find(class_=field).get_text() for field in fields}
for info in soup.find_all(class_="info")
]
pprint(result)
Prints:
[{'email': 'random@mail.com', 'hp': '123456', 'name': 'random'},
{'email': 'random123@mail.com', 'hp': '654321', 'name': 'random123'}]
来源:https://stackoverflow.com/questions/37336875/how-do-i-loop-a-re-search-for-the-next-data