问题
This is my html:
import pandas as pd
html_table = '''<table>
<thead>
<tr><th>Col1</th><th>Col2</th>
</thead>
<tbody>
<tr><td>1a</td><td>2a</td></tr>
</tbody>
<tbody>
<tr><td>1b</td><td>2b</td></tr>
</tbody>
</table>'''
If I run df = pd.read_html(html_table)
, and then print(df[0]
I get:
Col1 Col2
0 1a 2a
Col 2 disappears. Why? How to prevent it?
回答1:
The HTML you have posted is not a valid one. Multiple tbody
s is what confuses the pandas
parser logic. If you cannot fix the input html itself, you have to pre-parse it and "unwrap" all the tbody
elements:
import pandas as pd
from bs4 import BeautifulSoup
html_table = '''
<table>
<thead>
<tr><th>Col1</th><th>Col2</th>
</thead>
<tbody>
<tr><td>1a</td><td>2a</td></tr>
</tbody>
<tbody>
<tr><td>1b</td><td>2b</td></tr>
</tbody>
</table>'''
# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
for body in soup("tbody"):
body.unwrap()
df = pd.read_html(str(soup), flavor="bs4")
print(df[0])
Prints:
Col1 Col2
0 1a 2a
1 1b 2b
回答2:
Having multiple tbody tag causes the problem when pd.read_html() is called. Having multiple tbody tag is legal in html5 and can be convenient for styling but it looks like it is not supported by pd.read_html(). But if you can just use single <tbody>
it just works fine.
html_table1 = '''<table>
<thead>
<tr><th>Col1</th><th>Col2</th>
</thead>
<tbody>
<tr><td>1a</td><td>2a</td></tr>
<tr><td>1b</td><td>2b</td></tr>
</tbody>
</table>'''
df1 = pd.read_html(html_table1)
print(df1)
[ Col1 Col2
0 1a 2a
1 1b 2b]
来源:https://stackoverflow.com/questions/36314588/how-to-read-an-html-table-with-multiple-tbodies-with-python-pandas-read-html