BeautifulSoup: extract text from anchor tag

后端 未结 5 1880
误落风尘
误落风尘 2020-12-01 00:34

I want to extract:

  • text from following src of the image tag and
  • text of the anchor tag which is inside the div class data
相关标签:
5条回答
  • 2020-12-01 01:16

    All the above answers really help me to construct my answer, because of this I voted for all the answers that other users put it out: But I finally put together my own answer to exact problem I was dealing with:

    As question clearly defined I had to access some of the siblings and its children in a dom structure: This solution will iterate over the images in the dom structure and construct image name using product title and save the image to the local directory.

    import urlparse
    from urllib2 import urlopen
    from urllib import urlretrieve
    from BeautifulSoup import BeautifulSoup as bs
    import requests
    
    def getImages(url):
        #Download the images
        r = requests.get(url)
        html = r.text
        soup = bs(html)
        output_folder = '~/amazon'
        #extracting the images that in div(s)
        for div in soup.findAll('div', attrs={'class':'image'}):
            modified_file_name = None
            try:
                #getting the data div using findNext
                nextDiv =  div.findNext('div', attrs={'class':'data'})
                #use findNext again on previous object to get to the anchor tag
                fileName = nextDiv.findNext('a').text
                modified_file_name = fileName.replace(' ','-') + '.jpg'
            except TypeError:
                print 'skip'
            imageUrl = div.find('img')['src']
            outputPath = os.path.join(output_folder, modified_file_name)
            urlretrieve(imageUrl, outputPath)
    
    if __name__=='__main__':
        url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
        getImages(url)
    
    0 讨论(0)
  • 2020-12-01 01:17

    This will help:

    from bs4 import BeautifulSoup
    
    data = '''<div class="image">
            <a href="http://www.example.com/eg1">Content1<img  
            src="http://image.example.com/img1.jpg" /></a>
            </div>
            <div class="image">
            <a href="http://www.example.com/eg2">Content2<img  
            src="http://image.example.com/img2.jpg" /> </a>
            </div>'''
    
    soup = BeautifulSoup(data)
    
    for div in soup.findAll('div', attrs={'class':'image'}):
        print(div.find('a')['href'])
        print(div.find('a').contents[0])
        print(div.find('img')['src'])
    

    If you are looking into Amazon products then you should be using the official API. There is at least one Python package that will ease your scraping issues and keep your activity within the terms of use.

    0 讨论(0)
  • 2020-12-01 01:25

    In my case, it worked like that:

    from BeautifulSoup import BeautifulSoup as bs
    
    url="http://blabla.com"
    
    soup = bs(urllib.urlopen(url))
    for link in soup.findAll('a'):
            print link.string
    

    Hope it helps!

    0 讨论(0)
  • 2020-12-01 01:30

    I would suggest going the lxml route and using xpath.

    from lxml import etree
    # data is the variable containing the html
    data = etree.HTML(data)
    anchor = data.xpath('//a[@class="title"]/text()')
    
    0 讨论(0)
  • 2020-12-01 01:33
    >>> txt = '<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> '
    >>> fragment = bs4.BeautifulSoup(txt)
    >>> fragment
    <a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 
    >>> fragment.find('a', {'class': 'title'})
    <a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
    >>> fragment.find('a', {'class': 'title'}).string
    u'Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)'
    
    0 讨论(0)
提交回复
热议问题