问题
I am using beautiful soup(bs4) and Python I currently have this structure
<div class="class1">
<a class="name" href="/doctor/dr-xxxxxxxxx"><h2>Dr. XX XXXX</h2></a>
<p class="specialties"><a href="/location/abcd">ab cd</a></p>
<p class="doc-clinic-name">
<a class="light_grey link" href="/clinic/fff">f ff</a>
</p>
</div>
<div class="class2">
<p class="locality">
<a class="link grey" href="/location/doctors/ccc">c cc</a>
</p>
<p class="fees">INR 999</p>
<div class="timings">
<p><span class="strong">MON-SAT</span><br/><span>11:00AM-1:00PM</span> <span>6:00PM-8:00PM</span></p>
<div class="clear"></div>
</div>
So far the code i have is this
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('abc.com').read())
for post in soup.find("div", "class1"):
print post
for x in soup.find("div", "class2"):
print x
So basically post and x contain the divs class1 and class2. Now the information I want to extract is
DR.XXXXXX abcd fff ccc INR 999 MON-SAT 11:00AM-1:00PM
How do I branch inside the post and x variables to get the required info. Thanks
EDIT
I have added spaces in the html. Is it possible to produce a csv of the format without harming the spaces? DR. XX XXXX,ab cd,f ff,c cc,INR 999,MON-SAT 11:00AM-1:00PM
回答1:
>>> ' '.join(soup.find("div", "class1").getText().split())
u'Dr. XXXXXX abcd fff'
>>> ' '.join(soup.find("div", "class2").getText().split())
u'ccc INR 999 MON-SAT11:00AM-1:00PM 6:00PM-8:00PM'
回答2:
First off, your indentation looks wrong. Secondly, I don't think you need a for
loop when just using find
as it should just return the first match.
if you just want the links, you could try:
for link in soup.find("div", {"class": "class1"}).findAll("a"):
print link.text
or, if you want the link itself:
for link in soup.find("div", {"class": "class1"}).findAll("a"):
print link.get("href")
You should also note the method used to search for a class, by passing a dict to the find
method (Edit: I suspect there are other ways of doing this. This is just the way I learnt to do it!)
You can therefore be as specific as you need to be e.g.
doctorlink = soup.find(("div", {"class": "class1"}).find("a", {"class": "name"})
来源:https://stackoverflow.com/questions/21581147/extracting-scraping-text-from-a-href-inside-p-inside-div