HTML table to pandas table: Info inside html tags

后端 未结 3 1228
轮回少年
轮回少年 2021-01-04 21:02

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:


&l         
相关标签:
3条回答
  • 2021-01-04 21:38

    Since this parsing job requires the extraction of both text and attribute values, it can not be done entirely "out-of-the-box" by a function such as pd.read_html. Some of it has to be done by hand.

    Using lxml, you could extract the attribute values with XPath:

    import lxml.html as LH
    import pandas as pd
    
    content = '''
    <table>
    <tbody>
    <tr>
    <td>265</td>
    <td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
    <td >29</td>
    </tr>
    <tr >
    <td>266</td>
    <td> <a href="/s/smith01.shtml">Smith</a></td>
    <td>34</td>
    </tr>
    </tbody>
    </table>'''
    
    table = LH.fromstring(content)
    for df in pd.read_html(content):
        df['refname'] = table.xpath('//tr/td/a/@href')
        df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
        print(df)
    

    yields

         0          1   2  refname
    0  265  JonesBlue  29  jones03
    1  266      Smith  34  smith01
    

    The above may be useful since it requires only a few extra lines of code to add the refname column.

    But both LH.fromstring and pd.read_html parse the HTML. So it's efficiency could be improved by removing pd.read_html and parsing the table once with LH.fromstring:

    table = LH.fromstring(content)
    # extract the text from `<td>` tags
    data = [[elt.text_content() for elt in tr.xpath('td')] 
            for tr in table.xpath('//tr')]
    df = pd.DataFrame(data, columns=['id', 'name', 'val'])
    for col in ('id', 'val'):
        df[col] = df[col].astype(int)
    # extract the href attribute values
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)
    

    yields

        id        name  val  refname
    0  265   JonesBlue   29  jones03
    1  266       Smith   34  smith01
    
    0 讨论(0)
  • 2021-01-04 21:49

    You could use regular expressions to modify the text first and remove the html tags:

    import re, pandas as pd
    tbl = """<table>
    <tbody>
    <tr>
    <td>265</td>
    <td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
    <td>29</td>
    </tr>
    <tr >
    <td>266</td>
    <td> <a href="/s/smith01.shtml">Smith</a></td>
    <td>34</td>
    </tr>
    </tbody>
    </table>"""
    tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl)
    pd.read_html(tbl)
    

    which gives you

    [     0                           1   2
     0  265  /j/jones03.shtml JonesBlue  29
     1  266      /s/smith01.shtml Smith  34]
    
    0 讨论(0)
  • 2021-01-04 21:53

    You could simply parse the table manually like this:

    import BeautifulSoup
    import pandas as pd
    
    TABLE = """<table>
    <tbody>
    <tr>
    <td>265</td>
    <td <a href="/j/jones03.shtml">Jones</a>Blue</td>
    <td >29</td>
    </tr>
    <tr >
    <td>266</td>
    <td <a href="/s/smith01.shtml">Smith</a></td>
    <td>34</td>
    </tr>
    </tbody>
    </table>"""
    
    table = BeautifulSoup.BeautifulSoup(TABLE)
    records = []
    for tr in table.findAll("tr"):
        trs = tr.findAll("td")
        record = []
        record.append(trs[0].text)
        record.append(trs[1].a["href"])
        record.append(trs[2].text)
        records.append(record)
    
    df = pd.DataFrame(data=records)
    df
    

    which gives you

         0                 1   2
    0  265  /j/jones03.shtml  29
    1  266  /s/smith01.shtml  34
    
    0 讨论(0)
提交回复
热议问题