HTML table to pandas table: Info inside html tags

后端未结

关注

 3  1238

轮回少年 2021-01-04 21:02

I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:


&l         

        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   一向
                                             
                
                
                (楼主)
            
              
              
                2021-01-04 21:49
              

            
            
                        
You could use regular expressions to modify the text first and remove the html tags:

import re, pandas as pd
tbl = """



        
      
      
      



265
 JonesBlue
29


266
 Smith
34


"""
tbl = re.sub('(.*?)', '\\1 \\2', tbl)
pd.read_html(tbl)



which gives you

[     0                           1   2
 0  265  /j/jones03.shtml JonesBlue  29
 1  266      /s/smith01.shtml Smith  34]