HTML table to pandas table: Info inside html tags

后端 未结 3 1234
轮回少年
轮回少年 2021-01-04 21:02

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:


&l         
3条回答
  •  情歌与酒
    2021-01-04 21:38

    Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as pd.read_html. Some of it has to be done by hand.

    Using lxml, you could extract the attribute values with XPath:

    import lxml.html as LH
    import pandas as pd
    
    content = '''
    
265 JonesBlue 29
266 Smith 34
''' table = LH.fromstring(content) for df in pd.read_html(content): df['refname'] = table.xpath('//tr/td/a/@href') df['refname'] = df['refname'].str.extract(r'([^./]+)[.]') print(df)

yields

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01

The above may be useful since it requires only a few extra lines of code to add the refname column.

But both LH.fromstring and pd.read_html parse the HTML. So it's efficiency could be improved by removing pd.read_html and parsing the table once with LH.fromstring:

table = LH.fromstring(content)
# extract the text from `` tags
data = [[elt.text_content() for elt in tr.xpath('td')] 
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)

yields

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01

提交回复
热议问题