Parsing an html table with pd.read_html where cells contain full-tables themselves

问题

I need to parse a table from html that has other tables nested within the larger table. As called below with pd.read_html, each of these nested tables are parsed and then "inserted"/"concatenated" as rows.

I'd like these nested tables to each be parsed into their own pd.DataFrames and the inserted as objects as the value of the corresponding column.

If this is not possible, having raw html for the nested table as a string in the corresponding position would be fine.

Code as tested:

import pandas as pd
df_up = pd.read_html("up_pf00344.test.html", attrs = {'id': 'results'})

Screenshot of output:

Screenshot of table as rendered in html:

Link to file: https://gist.github.com/smsaladi/6adb30efbe70f9fed0306b226e8ad0d8#file-up_pf00344-test-html-L62

回答1:

You can't use read_html to read nested tables, but you can roll your own html reader and use read_html for table cells:

import pandas as pd
import bs4

with open('up_pf00344.test.html') as f:
    html = f.read()
soup = bs4.BeautifulSoup(html, 'lxml')
results = soup.find(attrs = {'id': 'results'})

# get first visible header row as dataframe headers
for row in results.thead.find_all('tr'):
    if 'display:none' not in row.get('style',''):
        df = pd.DataFrame(columns=[col.get_text() for col in row.find_all('th')])
    break

# append all table rows to dataframe
for row in results.tbody.find_all('tr', recursive=False):
    if 'display:none' in row.get('style',''):
        continue
    df_row = []
    for col in row.find_all('td', recursive=False):
        table = col.find_all('table')
        df_row.append(pd.read_html(str(col))[0] if table else col.get_text())
    df.loc[len(df)] = df_row

Result of df.iloc[0].map(type):

                                                            <class 'str'>
Entry                                                       <class 'str'>
Organism                                                    <class 'str'>
Protein names                                               <class 'str'>
Gene names                                                  <class 'str'>
Length                                                      <class 'str'>
Cross-reference (Pfam)                                      <class 'str'>
Cross-reference (InterPro)                                  <class 'str'>
Taxonomic lineage IDs                                       <class 'str'>
Subcellular location [CC]                                   <class 'str'>
Signal peptide                                              <class 'str'>
Transit peptide                                             <class 'str'>
Topological domain                  <class 'pandas.core.frame.DataFrame'>
Transmembrane                       <class 'pandas.core.frame.DataFrame'>
Intramembrane                       <class 'pandas.core.frame.DataFrame'>
Sequence caution                                            <class 'str'>
Caution                                                     <class 'str'>
Taxonomic lineage (SUPERKINGDOM)                            <class 'str'>
Taxonomic lineage (KINGDOM)                                 <class 'str'>
Taxonomic lineage (PHYLUM)                                  <class 'str'>
Cross-reference (RefSeq)                                    <class 'str'>
Cross-reference (EMBL)                                      <class 'str'>
e                                                           <class 'str'>

Bonus: As your table rows have an id, you could use it as index of your dataframe df.loc[row.get('id')] = df_row instead of df.loc[len(df)] = df_row.

来源：https://stackoverflow.com/questions/58280302/parsing-an-html-table-with-pd-read-html-where-cells-contain-full-tables-themselv

标签

python

html

pandas

beautifulsoup

lxml