问题
I've written a script in python to get data from some html elements which are in a table. I have roughly picked some data which are within a tr tag. My goal is to get the data (including href links) within class fn. What I have tried so far can parse all of them (from class fn excluding the links). How can I change my below script to get the links as well from that class. Thanks in advance for any solution.
This is what I've tried so far:
from bs4 import BeautifulSoup
content="""
<tr>
<td align="center">1964</td>
<td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
<span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
<span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
<td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
<td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
<span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
<td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
<td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
<td align="center">—</td>
</tr>
"""
soup = BeautifulSoup(content,"lxml")
for items in soup.select('tr'):
item_name = [item.text for item in items.select(".fn a")]
print(item_name)
Output I'm having now:
['Charles Hard Townes', 'Nikolay Basov', 'Alexander Prokhorov', 'Dorothy Hodgkin', 'Konrad Emil Bloch', 'Feodor Felix Konrad Lynen', 'Jean-Paul Sartre', 'Martin Luther King, Jr.']
To remind you again: my expected output is to get all the data including href links from class fn.
回答1:
This modified code got me the href together with the data
from bs4 import BeautifulSoup
content="""
<tr>
<td align="center">1964</td>
<td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
<span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
<span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
<td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
<td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
<span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
<td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
<td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
<td align="center">—</td>
</tr>
"""
soup = BeautifulSoup(content,"lxml")
for items in soup.select('tr'):
item_name = [[item.text,item.get('href')] for item in items.select(".fn a")]
print(item_name)
OUTPUT
[['Charles Hard Townes', '/wiki/Charles_Hard_Townes'], ['Nikolay Basov', '/wiki/Nikolay_Basov'], ['Alexander Prokhorov', '/wiki/Alexander_Prokhorov'], ['Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'], ['Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'], ['Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'], ['Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'], ['Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.']]
回答2:
You can use either bs4 or regular expressions:
bs4:
from bs4 import BeautifulSoup as soup
s = soup(content, 'lxml')
new_data = list(zip([i.text for i in s.find_all('a')], [i['href'] for i in s.find_all('a', href=True)]))
Output:
[(u'Charles Hard Townes', '/wiki/Charles_Hard_Townes'), (u'Nikolay Basov', '/wiki/Nikolay_Basov'), (u'Alexander Prokhorov', '/wiki/Alexander_Prokhorov'), (u'Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin'), (u'Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch'), (u'Feodor Felix Konrad Lynen', '/wiki/Feodor_Felix_Konrad_Lynen'), (u'Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre'), (u'[D]', '#endnote_Note1D'), (u'Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')]
Regex:
import re
new_data = map(lambda x:filter(None, x)[0], re.findall('href="(.*?)"|title="(.*?)">', content))
final_data = [(new_data[i], new_data[i+1]) for i in range(0, len(new_data)-1, 2)]
Output:
[('/wiki/Charles_Hard_Townes', 'Charles Hard Townes'), ('/wiki/Nikolay_Basov', 'Nikolay Basov'), ('/wiki/Alexander_Prokhorov', 'Alexander Prokhorov'), ('/wiki/Dorothy_Hodgkin', 'Dorothy Hodgkin'), ('/wiki/Konrad_Emil_Bloch', 'Konrad Emil Bloch'), ('/wiki/Feodor_Felix_Konrad_Lynen', 'Feodor Felix Konrad Lynen'), ('/wiki/Jean-Paul_Sartre', 'Jean-Paul Sartre'), ('#endnote_Note1D', '/wiki/Martin_Luther_King,_Jr.')]
回答3:
Slightly simpler: no need to select the table rows separately.
soup = BeautifulSoup(content,"lxml")
links = soup.select('tr .fn a')
for link in links:
print (link.attrs['href'])
print (link.text)
回答4:
You can try bs4 instead of using regex :
from bs4 import BeautifulSoup
content="""
<tr>
<td align="center">1964</td>
<td><span class="sortkey">Townes, Charles Hard</span><span class="vcard"><span class="fn"><a href="/wiki/Charles_Hard_Townes" class="mw-redirect" title="Charles Hard Townes">Charles Hard Townes</a></span></span>;<br>
<span class="sortkey">Basov, Nikolay</span><span class="vcard"><span class="fn"><a href="/wiki/Nikolay_Basov" title="Nikolay Basov">Nikolay Basov</a></span></span>;<br>
<span class="sortkey">Prokhorov, Alexander</span><span class="vcard"><span class="fn"><a href="/wiki/Alexander_Prokhorov" title="Alexander Prokhorov">Alexander Prokhorov</a></span></span></td>
<td><span class="sortkey">Hodgkin, Dorothy</span><span class="vcard"><span class="fn"><a href="/wiki/Dorothy_Hodgkin" title="Dorothy Hodgkin">Dorothy Hodgkin</a></span></span></td>
<td><span class="sortkey">Bloch, Konrad Emil</span><span class="vcard"><span class="fn"><a href="/wiki/Konrad_Emil_Bloch" title="Konrad Emil Bloch">Konrad Emil Bloch</a></span></span>;<br>
<span class="sortkey">Lynen, Feodor Felix Konrad</span><span class="vcard"><span class="fn"><a href="/wiki/Feodor_Felix_Konrad_Lynen" class="mw-redirect" title="Feodor Felix Konrad Lynen">Feodor Felix Konrad Lynen</a></span></span></td>
<td><span class="sortkey">Sartre, Jean-Paul</span><span class="vcard"><span class="fn"><a href="/wiki/Jean-Paul_Sartre" title="Jean-Paul Sartre">Jean-Paul Sartre</a></span></span><sup class="reference" id="ref_Note1D"><a href="#endnote_Note1D">[D]</a></sup></td>
<td><span class="sortkey">King, Jr., Martin Luther</span><span class="vcard"><span class="fn"><a href="/wiki/Martin_Luther_King,_Jr." class="mw-redirect" title="Martin Luther King, Jr.">Martin Luther King, Jr.</a></span></span></td>
<td align="center">—</td>
</tr>
"""
soup = BeautifulSoup(content,"lxml")
for i in soup.find_all('td'):
if i.find('a')!=None:
print((i.find('a').attrs['title'],i.find('a').attrs['href']))
output:
('Charles Hard Townes', '/wiki/Charles_Hard_Townes')
('Dorothy Hodgkin', '/wiki/Dorothy_Hodgkin')
('Konrad Emil Bloch', '/wiki/Konrad_Emil_Bloch')
('Jean-Paul Sartre', '/wiki/Jean-Paul_Sartre')
('Martin Luther King, Jr.', '/wiki/Martin_Luther_King,_Jr.')
来源:https://stackoverflow.com/questions/48051766/unable-to-get-all-the-data-including-links-from-a-tr-tag