HTML table to pandas table: Info inside html tags

后端未结
关注
 3  1237
I have a large table from the web, accessed via requests and parsed with BeautifulSoup. Part of it looks something like this:

&l         

              相关标签:
       

        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2021-01-04 21:38
              
            
            
                                                                       
Since this parsing job requires the extraction of both text and attribute
values, it can not be done entirely "out-of-the-box" by a function such as
pd.read_html. Some of it has to be done by hand.

Using lxml, you could extract the attribute values with XPath:

import lxml.html as LH
import pandas as pd

content = '''
<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>'''

table = LH.fromstring(content)
for df in pd.read_html(content):
    df['refname'] = table.xpath('//tr/td/a/@href')
    df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
    print(df)


yields

     0          1   2  refname
0  265  JonesBlue  29  jones03
1  266      Smith  34  smith01




The above may be useful since it requires only a few
extra lines of code to add the refname column.

But both LH.fromstring and pd.read_html parse the HTML.
So it's efficiency could be improved by removing pd.read_html and 
parsing the table once with LH.fromstring:

table = LH.fromstring(content)
# extract the text from `<td>` tags
data = [[elt.text_content() for elt in tr.xpath('td')] 
        for tr in table.xpath('//tr')]
df = pd.DataFrame(data, columns=['id', 'name', 'val'])
for col in ('id', 'val'):
    df[col] = df[col].astype(int)
# extract the href attribute values
df['refname'] = table.xpath('//tr/td/a/@href')
df['refname'] = df['refname'].str.extract(r'([^./]+)[.]')
print(df)


yields

    id        name  val  refname
0  265   JonesBlue   29  jones03
1  266       Smith   34  smith01

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2021-01-04 21:49
              
            
            
                                                                       
You could use regular expressions to modify the text first and remove the html tags:

import re, pandas as pd
tbl = """<table>
<tbody>
<tr>
<td>265</td>
<td> <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td>29</td>
</tr>
<tr >
<td>266</td>
<td> <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""
tbl = re.sub('<a.*?href="(.*?)">(.*?)</a>', '\\1 \\2', tbl)
pd.read_html(tbl)


which gives you

[     0                           1   2
 0  265  /j/jones03.shtml JonesBlue  29
 1  266      /s/smith01.shtml Smith  34]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2021-01-04 21:53
              
            
            
                                                                       
You could simply parse the table manually like this:

import BeautifulSoup
import pandas as pd

TABLE = """<table>
<tbody>
<tr>
<td>265</td>
<td <a href="/j/jones03.shtml">Jones</a>Blue</td>
<td >29</td>
</tr>
<tr >
<td>266</td>
<td <a href="/s/smith01.shtml">Smith</a></td>
<td>34</td>
</tr>
</tbody>
</table>"""

table = BeautifulSoup.BeautifulSoup(TABLE)
records = []
for tr in table.findAll("tr"):
    trs = tr.findAll("td")
    record = []
    record.append(trs[0].text)
    record.append(trs[1].a["href"])
    record.append(trs[2].text)
    records.append(record)

df = pd.DataFrame(data=records)
df


which gives you

     0                 1   2
0  265  /j/jones03.shtml  29
1  266  /s/smith01.shtml  34

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复
            
          
        
      

          
 
     
 
        热议问题