Parsing an html table with pd.read_html where cells contain full-tables themselves

梦想的初衷 提交于 2021-01-28 20:07:22

问题


I need to parse a table from html that has other tables nested within the larger table. As called below with pd.read_html, each of these nested tables are parsed and then "inserted"/"concatenated" as rows.

I'd like these nested tables to each be parsed into their own pd.DataFrames and the inserted as objects as the value of the corresponding column.

If this is not possible, having raw html for the nested table as a string in the corresponding position would be fine.

Code as tested:

import pandas as pd
df_up = pd.read_html("up_pf00344.test.html", attrs = {'id': 'results'})

Screenshot of output:

Screenshot of table as rendered in html:

Link to file: https://gist.github.com/smsaladi/6adb30efbe70f9fed0306b226e8ad0d8#file-up_pf00344-test-html-L62


回答1:


You can't use read_html to read nested tables, but you can roll your own html reader and use read_html for table cells:

import pandas as pd
import bs4

with open('up_pf00344.test.html') as f:
    html = f.read()
soup = bs4.BeautifulSoup(html, 'lxml')
results = soup.find(attrs = {'id': 'results'})

# get first visible header row as dataframe headers
for row in results.thead.find_all('tr'):
    if 'display:none' not in row.get('style',''):
        df = pd.DataFrame(columns=[col.get_text() for col in row.find_all('th')])
    break

# append all table rows to dataframe
for row in results.tbody.find_all('tr', recursive=False):
    if 'display:none' in row.get('style',''):
        continue
    df_row = []
    for col in row.find_all('td', recursive=False):
        table = col.find_all('table')
        df_row.append(pd.read_html(str(col))[0] if table else col.get_text())
    df.loc[len(df)] = df_row

Result of df.iloc[0].map(type):

                                                            <class 'str'>
Entry                                                       <class 'str'>
Organism                                                    <class 'str'>
Protein names                                               <class 'str'>
Gene names                                                  <class 'str'>
Length                                                      <class 'str'>
Cross-reference (Pfam)                                      <class 'str'>
Cross-reference (InterPro)                                  <class 'str'>
Taxonomic lineage IDs                                       <class 'str'>
Subcellular location [CC]                                   <class 'str'>
Signal peptide                                              <class 'str'>
Transit peptide                                             <class 'str'>
Topological domain                  <class 'pandas.core.frame.DataFrame'>
Transmembrane                       <class 'pandas.core.frame.DataFrame'>
Intramembrane                       <class 'pandas.core.frame.DataFrame'>
Sequence caution                                            <class 'str'>
Caution                                                     <class 'str'>
Taxonomic lineage (SUPERKINGDOM)                            <class 'str'>
Taxonomic lineage (KINGDOM)                                 <class 'str'>
Taxonomic lineage (PHYLUM)                                  <class 'str'>
Cross-reference (RefSeq)                                    <class 'str'>
Cross-reference (EMBL)                                      <class 'str'>
e                                                           <class 'str'>

Bonus: As your table rows have an id, you could use it as index of your dataframe df.loc[row.get('id')] = df_row instead of df.loc[len(df)] = df_row.



来源:https://stackoverflow.com/questions/58280302/parsing-an-html-table-with-pd-read-html-where-cells-contain-full-tables-themselv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!