how do i loop a re.search for the next data

问题

I have a 2 set of data i crawled from a html table using regex expression

data:

 <div class = "info"> 
   <div class="name"><td>random</td></div>
   <div class="hp"><td>123456</td></div>
   <div class="email"><td>random@mail.com</td></div> 
 </div>

 <div class = "info"> 
   <div class="name"><td>random123</td></div>
   <div class="hp"><td>654321</td></div>
   <div class="email"><td>random123@mail.com</td></div> 
 </div>

regex:

matchname = re.search('\<div class="name"><td>(.*?)</td>' , match3).group(1)
matchhp = re.search('\<div class="hp"><td>(.*?)</td>' , match3).group(1)
matchemail = re.search('\<div class="email"><td>(.*?)</td>' , match3).group(1)

so using the regex i can take out

random

123456

random@mail.com

so after saving this set of data into my database i want to save the next set how do i get the next set of data? i tried using findall then insert into my db but everything was in 1 line. I need the data to be in the db set by set.

New to python please comment on which part is unclear will try to edit

回答1:

You should not be parsing HTML with regex. It's just a mess, do it with BS4. Doing it the right way:

soup = BeautifulSoup(match3, "html.parser")
names = []
allTds = soup.find_all("td")
for i,item in enumerate(allTds[::3]):
    #            firstname   hp                email
    names.append((item.text, allTds[(i*3)+1].text, allTds[(i*3)+2].text))

And for the sake of answering the question asked I guess I'll include a horrible ugly regex that you should never use. ESPECIALLY because it's html, don't ever use regex for parsing html. (please don't use this)

for thisMatch in re.findall(r"<td>(.+?)</td>.+?<td>(.+?)</td>.+?<td>(.+?)</td>", match3, re.DOTALL):
    print(thisMatch[0], thisMatch[1], thisMatch[2])

回答2:

As @Racialz pointed out, you should look into using HTML parsers instead of regular expressions.

Let's take BeautifulSoup as well as @Racialz did, but build a more robust solution. Find all info elements and locate all fields inside producing a list of dictionaries in the output:

from pprint import pprint

from bs4 import BeautifulSoup

data = """
<div>
    <div class = "info">
       <div class="name"><td>random</td></div>
       <div class="hp"><td>123456</td></div>
       <div class="email"><td>random@mail.com</td></div>
    </div>

    <div class = "info">
       <div class="name"><td>random123</td></div>
       <div class="hp"><td>654321</td></div>
       <div class="email"><td>random123@mail.com</td></div>
    </div>
</div>
 """
soup = BeautifulSoup(data, "html.parser")

fields = ["name", "hp", "email"]

result = [
    {field: info.find(class_=field).get_text() for field in fields}
    for info in soup.find_all(class_="info")
]

pprint(result)

Prints:

[{'email': 'random@mail.com', 'hp': '123456', 'name': 'random'},
 {'email': 'random123@mail.com', 'hp': '654321', 'name': 'random123'}]

来源：https://stackoverflow.com/questions/37336875/how-do-i-loop-a-re-search-for-the-next-data

标签

python

html

regex

html-parsing