How to parse an HTML table with rowspans in Python?

前端 未结 2 926
走了就别回头了
走了就别回头了 2021-01-07 18:57

The problem

I\'m trying to parse an HTML table with rowspans in it, as in, I\'m trying to parse my college schedule.

I\'m running into the p

2条回答
  •  我在风中等你
    2021-01-07 19:56

    Maybe it is better to use bs4 builtin function like "findAll" to parse your table.

    You may use the following code :

    from pprint import pprint
    from bs4 import BeautifulSoup
    import requests
    
    r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
                     "/c/c00025.htm")
    
    content=r.content
    page = BeautifulSoup(content, "html")
    table=page.find('table')
    trs=table.findAll("tr", {},recursive=False)
    tr_count=0
    trs.pop(0)
    final_table={}
    
    for tr in trs:
        tds=tr.findAll("td", {},recursive=False)
        if tds:
            td_count=0
            tds.pop(0)
            for td in tds:
                if td.has_attr('rowspan'):                              
                    final_table[str(tr_count)+"-"+str(td_count)]=td.text.strip()
                    if int(td.attrs['rowspan'])==4:
                        final_table[str(tr_count+1)+"-"+str(td_count)]=td.text.strip()
                    if final_table.has_key(str(tr_count)+"-"+str(td_count+1)):
                        td_count=td_count+1         
                td_count=td_count+1
            tr_count=tr_count+1
    
    roster=[]
    for i in range(0,10): #iterate over time
        for j in range(0,5): #iterate over day
            item=final_table[str(i)+"-"+str(j)]
            if len(item)!=0:    
                block_eind=i+1          
    
                try:
                    if final_table[str(i+1)+"-"+str(j)]==final_table[str(i)+"-"+str(j)]:
                            block_eind=i+2
                except:
                    pass
    
                try:
                    lokaal=item.split('\r\n \n\n')[0]
                    leraar=item.split('\r\n \n\n')[1].split('\n \n\r\n')[0]
                    vak=item.split('\n \n\r\n')[1]
                except:
                    lokaal=leraar=vak="---"
    
                dayroster = {
                    "dag": j+1,
                    "blok_start": i+1,
                    "blok_eind": block_eind,
                    "lokaal": lokaal,
                    "leraar": leraar,
                    "vak": vak
                }
    
    
                dayroster_double = {
                    "dag": j+1,
                    "blok_start": i,
                    "blok_eind": block_eind,
                    "lokaal": lokaal,
                    "leraar": leraar,
                    "vak": vak
                }
    
                #use to prevent double dict for same event
                if dayroster_double not in roster:
                    roster.append(dayroster)
    
    print (roster)
    

提交回复
热议问题