Python associate urls's ids and url's titles in lists

问题

continution of this question: Python beautifulsoup how to get the line after 'href'

I have this HTML code

    <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html" class="ss-titre"> 
                            Monte le son         </a>
    <div class="rs-cell-details">
                            <a href="http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html"  class="ss-titre">
                                    "Rubin_Steiner"                 </a>
<a href="http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html" class="ss-titre"> 
                        Fare maohi              </a>

As you see, "Monte le son" and ' "Rubin_Steiner" ' are associate with the same id (101973832) and "Fare maohi" is associate with the id 102103928.

So, actually I have these lists (example with one result, one id):

url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html', 'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']      
titles = ['Monte le son', 'Rubin_Steiner', 'Fare maohi']   #2 entries for id 101973832
                                                           #1 entry for id 102103928

Titles could have 3 entries, or 1, or none...

How can I associate the Id of the adress (101973832) and the titles, to get this result:

result = ['"Monte le son Rubin_Steiner 101973832"', 'Fare maohi 102103928']

The result will be used to display in my Gtk interface. It need to contain the id to find the corresponding url like this:

choice = self.liste.get_active_text()     # choice = result   
for adress in url:
        if id in adress: 
            adresse = url

I hope my question is not too difficult to understand...

Edit: I get the title and the urls like this:

url = "http://pluzz.francetv.fr/recherche?recherche=" + mot # mot is a word for my Gtk search
try:
   f = urllib.urlopen(url)
   page = f.read()
   f.close()
except: 
   self.champ.set_text("La recherche a échoué")
   pass    
soup = BeautifulSoup(page)
titres=[]
list_url=[]
for link in soup.findAll('a'):
     lien = link.get('href')
     if lien == None:
         lien = ""
     if "http://pluzz.francetv.fr/videos/" in lien:
         titre = (link.text.strip())
         if "Voir cette  vidéo" in titre:
              titre = ""
         if "Lire la vidéo" in titre:
              titre = ""
         titres.append(titre)
         list_url.append(lien)

回答1:

If I understand you correctly and all your urls and titles will be in a list like your example.

import re

In [111]: titles = ['Monte le son', 'Rubin_Steiner']

In [112]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html']

In [113]: get_id = get_id = re.findall('\d+', url[0]) # find consecutive digits

In [114]: results = [x for x in titles] + get_id

In [115]: results
Out[115]: ['Monte le son', 'Rubin_Steiner', '101973832']

As I say in my comments, when you add titles to your titles list, group corresponding titles in sublists, it is impossible to tell which belongs where without some way of indexing the groupings. I have grouped them in sublists to show you how it works.

In [3]: url = ['http://pluzz.francetv.fr/videos/monte_le_son_live_,101973832.html',   'http://pluzz.francetv.fr/videos/fare_maohi_,102103928.html']

In [4]: titles = [['Monte le son', 'Rubin_Steiner'], ['Fare maohi']]   # need to sub list to match to url position

In [5]: get_ids = [re.findall('\d+', x) for x in url] # get all ids, position in list will match sub list position in titles

In [6]: results= [t + i for t, i in zip(titles, get_ids)] # this is why sub lists are useful, each position of the sub lists correspond.

In [7]: results

Out[7]: [['Monte le son', 'Rubin_Steiner', '101973832'], ['Fare maohi', '102103928']]

In [11]: final_results=[ " ".join(y) for y in  results ]

In [12]: final_results

Out[12]: ['Monte le son Rubin_Steiner 101973832', 'Fare maohi 102103928'] # join strings in each sublist

来源：https://stackoverflow.com/questions/23674562/python-associate-urlss-ids-and-urls-titles-in-lists

标签

python

list

beautifulsoup