BeautifulSoup: Replace anchor text with text from another tag

回眸只為那壹抹淺笑 提交于 2019-12-25 01:55:44

问题


I'm trying to extract all links on a page and so far I'm able to get the links but the anchor text in the link doesn't provide any relevant information. That information is contained in another sibling tag.

This is the Html Layout:

<tbody>
<tr>
     <td>
        <h3>Driver with license E or F</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
    </td>
</tr>
    <tr>
    <td>
        <h3>Payroll Administrator</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
    </td>
</tr>
<tr>
    <td>
        <h3>Head of Sales and Marketing</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
   </td>
</tr>
</tbody>

When I extract the links, I get the following:

<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>

But:

  1. I'm interested in replacing the text Go To Details with the text in the tag in each case.

  2. These links will be displayed on an external website so I prefer them to be absolute instead of relative

hence in the end I'm hoping for something like these:

<a href="http://www.example.com/show_classifieds?..." class="bar">Driver with license E or F</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Payroll Administrator</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Head of Sales and Marketing</a>

Any help will be gracefully appreciated


回答1:


To give you a stable solution, you really need to make sure that all pages follow exactly the same pattern as your example.

Basic Assumption:

Assuming the text you want always resides in the h3 tag which is the sibling of div print, who is the parent of the anchor link.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
    # here is how you get the text from 'h3' tag
    header = a.parent.find_previous_sibling('h3').text
    # here is how you set the text of the anchor tag to be the text of 'h3' tag
    a.string = header
    print a

Further Reading: tag.string

(You can use urljoin with the domain name to construct absolute urls if you want) urljoin

Output:

<a class="bar" href="show_classifieds?...">Driver with license E or F</a>
<a class="bar" href="show_classifieds?...">Payroll Administrator</a>
<a class="bar" href="show_classifieds?...">Head of Sales and Marketing</a>


来源:https://stackoverflow.com/questions/20157186/beautifulsoup-replace-anchor-text-with-text-from-another-tag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!