问题
I'm trying to extract all links on a page and so far I'm able to get the links but the anchor text in the link doesn't provide any relevant information. That information is contained in another sibling tag.
This is the Html Layout:
<tbody>
<tr>
<td>
<h3>Driver with license E or F</h3>
<div class = "date">..</div>
<br>
<p>...</p>
<div id='print'>
<a href="show_classifieds?..." class="bar">Go To Details</a>
</div>
<br>
</td>
</tr>
<tr>
<td>
<h3>Payroll Administrator</h3>
<div class = "date">..</div>
<br>
<p>...</p>
<div id='print'>
<a href="show_classifieds?..." class="bar">Go To Details</a>
</div>
<br>
</td>
</tr>
<tr>
<td>
<h3>Head of Sales and Marketing</h3>
<div class = "date">..</div>
<br>
<p>...</p>
<div id='print'>
<a href="show_classifieds?..." class="bar">Go To Details</a>
</div>
<br>
</td>
</tr>
</tbody>
When I extract the links, I get the following:
<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>
But:
I'm interested in replacing the text Go To Details with the text in the tag in each case.
These links will be displayed on an external website so I prefer them to be absolute instead of relative
hence in the end I'm hoping for something like these:
<a href="http://www.example.com/show_classifieds?..." class="bar">Driver with license E or F</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Payroll Administrator</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Head of Sales and Marketing</a>
Any help will be gracefully appreciated
回答1:
To give you a stable solution, you really need to make sure that all pages follow exactly the same pattern as your example.
Basic Assumption:
Assuming the text you want always resides in the h3 tag which is the sibling of div print, who is the parent of the anchor link.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
# here is how you get the text from 'h3' tag
header = a.parent.find_previous_sibling('h3').text
# here is how you set the text of the anchor tag to be the text of 'h3' tag
a.string = header
print a
Further Reading: tag.string
(You can use urljoin with the domain name to construct absolute urls if you want) urljoin
Output:
<a class="bar" href="show_classifieds?...">Driver with license E or F</a>
<a class="bar" href="show_classifieds?...">Payroll Administrator</a>
<a class="bar" href="show_classifieds?...">Head of Sales and Marketing</a>
来源:https://stackoverflow.com/questions/20157186/beautifulsoup-replace-anchor-text-with-text-from-another-tag