I have an html document, and I want to pull the tables out of this document and return them as arrays. I\'m picturing 2 functions, one that finds all the html tables in a d
Pandas can extract all of the tables in your html to a list of dataframes right out of the box, saving you from having to parse the page yourself (reinventing the wheel). A DataFrame is a powerful type of 2-dimensional array.
I recommend continuing to work with the data via Pandas since it's a great tool, but you can also convert to other formats if you prefer (list, dictionary, csv file, etc.).
Example
"""Extract all tables from an html file, printing and saving each to csv file."""
import pandas as pd
df_list = pd.read_html('my_file.html')
for i, df in enumerate(df_list):
print df
df.to_csv('table {}.csv'.format(i))
Getting the html content directly from the web instead of from a file would only require a slight modification:
import requests
html = requests.get('my_url').content
df_list = pd.read_html(html)