Use Pandas to Get Multiple Tables From Webpage

柔情痞子 提交于 2021-02-08 09:57:32

问题


I am using Pandas to parse the data from the following page: http://kenpom.com/index.php?y=2014

To get the data, I am writing:

dfs = pd.read_html(url) 

The data looks great and is perfectly parsed, except it only takes data from the 40 first rows. It seems to be a problem with the separation of the tables, that makes it so that pandas does no get all the information.

How do you get pandas to get all the data from all the tables on that webpage?


回答1:


The HTML of page you have posted have multiple <thead> and <tbody> tags wich confuses pandas.read_html.

Following this SO thread you can manually unwrap those tags:

import urllib
from bs4 import BeautifulSoup

html_table = urllib.request.urlopen(url).read()

# fix HTML
soup = BeautifulSoup(html_table, "html.parser")
# warn! id ratings-table is your page specific
for table in soup.findChildren(attrs={'id': 'ratings-table'}): 
    for c in table.children:
        if c.name in ['tbody', 'thead']:
            c.unwrap()

df = pd.read_html(str(soup), flavor="bs4")
len(df[0])

which returns 369.



来源:https://stackoverflow.com/questions/42225204/use-pandas-to-get-multiple-tables-from-webpage

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!