Since this parsing job requires the extraction of both text and attribute
values, it can not be done entirely "out-of-the-box" by a function such as
pd.read_html
. Some of it has to be done by hand.
Using lxml, you could extract the attribute values with XPath:
import lxml.html as LH
import pandas as pd
content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''
table = LH.fromstring(content)
for df in pd.read_html(content):
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
0 1 2 refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01
The above may be useful since it requires only a few
extra lines of code to add the refname
column.
But both LH.fromstring
and pd.read_html
parse the HTML.
So it's efficiency could be improved by removing pd.read_html
and
parsing the table once with LH.fromstring
:
table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')]
for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)
yields
id name val refname
0 265 JonesBlue 29 jones03
1 266 Smith 34 smith01