How to convert an HTML table to an array in python

前端 未结 3 1864
故里飘歌
故里飘歌 2020-12-24 03:23

I have an html document, and I want to pull the tables out of this document and return them as arrays. I\'m picturing 2 functions, one that finds all the html tables in a d

相关标签:
3条回答
  • 2020-12-24 04:01

    Use BeautifulSoup (I recommend 3.0.8). Finding all tables is trivial:

    import BeautifulSoup
    
    def get_tables(htmldoc):
        soup = BeautifulSoup.BeautifulSoup(htmldoc)
        return soup.findAll('table')
    

    However, in Python, an array is 1-dimensional and constrained to pretty elementary types as items (integers, floats, that elementary). So there's no way to squeeze an HTML table in a Python array.

    Maybe you mean a Python list instead? That's also 1-dimensional, but anything can be an item, so you could have a list of lists (one sublist per tr tag, I imagine, containing one item per td tag).

    That would give:

    def makelist(table):
      result = []
      allrows = table.findAll('tr')
      for row in allrows:
        result.append([])
        allcols = row.findAll('td')
        for col in allcols:
          thestrings = [unicode(s) for s in col.findAll(text=True)]
          thetext = ''.join(thestrings)
          result[-1].append(thetext)
      return result
    

    This may not yet be quite what you want (doesn't skip HTML comments, the items of the sublists are unicode strings and not byte strings, etc) but it should be easy to adjust.

    0 讨论(0)
  • 2020-12-24 04:07

    A +1 to the question-asker and another to the god of Python.
    Wanted to try this example using lxml and CSS selectors.
    Yes, this is mostly the same as Alex's example:

    import lxml.html
    markup = lxml.html.fromstring('''<html><body>\
    <table width="600">
        <tr>
            <td width="50%">0,0,0</td>
            <td width="50%">0,0,1</td>
        </tr>
        <tr>
            <td>0,1,0</td>
            <td>0,1,1</td>
        </tr>
    </table>
    <table>
        <tr>
            <td>1,0,0</td>
            <td>1,<blink>0,</blink>1</td>
            <td>1,0,2</td>
            <td><bold>1</bold>,0,3</td>
        </tr>
    </table>
    </body></html>''')
    
    tbl = []
    rows = markup.cssselect("tr")
    for row in rows:
      tbl.append(list())
      for td in row.cssselect("td"):
        tbl[-1].append(unicode(td.text_content()))
    
    pprint(tbl)
    #[[u'0,0,0', u'0,0,1'],
    # [u'0,1,0', u'0,1,1'],
    # [u'1,0,0', u'1,0,1', u'1,0,2', u'1,0,3']]
    
    0 讨论(0)
  • 2020-12-24 04:11

    Pandas can extract all of the tables in your html to a list of dataframes right out of the box, saving you from having to parse the page yourself (reinventing the wheel). A DataFrame is a powerful type of 2-dimensional array.

    I recommend continuing to work with the data via Pandas since it's a great tool, but you can also convert to other formats if you prefer (list, dictionary, csv file, etc.).

    Example

    """Extract all tables from an html file, printing and saving each to csv file."""
    
    import pandas as pd
    
    df_list = pd.read_html('my_file.html')
    
    for i, df in enumerate(df_list):
        print df
        df.to_csv('table {}.csv'.format(i))
    

    Getting the html content directly from the web instead of from a file would only require a slight modification:

    import requests
    
    html = requests.get('my_url').content
    df_list = pd.read_html(html)
    
    0 讨论(0)
提交回复
热议问题